spark-submit.sh script

You can use the provided spark-submit.sh script to launch and manage your Apache Spark applications from a client machine. This script recognizes a subset of the configuration properties used by the spark-submit script provided by Apache Spark. It also introduces several additional commands and environment variables that are specific to the management of Spark applications within Db2® Warehouse.

Restrictions

The spark-submit.sh script for Db2 Warehouse:

Can be used only on Linux® and MacOS operating systems
Can be used only to submit Scala, Java™, R, and Python applications
Does not support SSL certificate verification (certificates are ignored and no peer verification takes place)

Prerequisite

To use the spark-submit.sh script, you must first download and install the cURL command-line tool. For more information, see Required REST tooling.

Before you begin

Before you begin:

Download the spark-submit.sh script from the console. To do this, click CONNECT > Download Tools > Database support tools.

Enter one or more of the following export commands to set environment variables that simplify the use of spark-submit.sh:

export DASHDBURL="https://hostname:8443 # The URL of your dashDB web console.
export DASHDBUSER=user-name             # Your user name.
export DASHDBPASS=password              # Your password.
export DASHDBJSONOUT=YES|NO             # Whether output is to be returned in JSON (YES) or readable (NO) format.

Setting these environment variables eliminates the need for you to specify the same information each time you use spark-submit.sh.

Syntax

application options

Description

file_name

The name of the file that contains the application code that is to be submitted.

arguments

Specify any arguments that are required as input for the application that is being run.

--class

For application code that is written in Java or Scala, this option specifies the name of the main class.

--jars

For application code that is written in Java or Scala, this option specifies a comma-separated list of any .jar files that are used by the application. These files must be in the $HOME/spark/apps directory.

--py-files

For application code that is written in Python, this option specifies a comma-separated list of any .py, .zip, or .egg files that are used by the application. These files must be in the $HOME/spark/apps directory.

--name

This option specifies the name that is to be assigned to the application that is being launched. If this option is not specified, the name is set to the class name (for a Java or Scala application) or file name (for a Python or R application).

--loc

This option specifies the location of the file that contains the application code that is to be submitted:

host: The file is located on the Db2 Warehouse host system. Any path information specified in the file name indicates the path, relative to the $HOME/spark/apps directory, to the file. This is the default.
client: The file is located on your client system. Any path information specified in the file name indicates the path, relative to the current directory, to the file. The file is automatically deployed to the $HOME/spark/apps/temp directory before being submitted, and any file with the same name that is already in that directory is overwritten.

For example, the following two commands specify identical file paths (subdir6/cool.jar) but different file locations:

The file is $HOME/spark/apps/subdir6/cool.jar, on the host:

./spark-submit.sh --class c.myclass subdir6/cool.jar --loc host

The file is ./subdir6/cool.jar, on the client:

./spark-submit.sh --class c.myclass subdir6/cool.jar --loc client

--master

If the Db2 Warehouse URL to which the application is to be submitted is different from the Db2 Warehouse URL that is currently set, use this option to specify the new URL. If the application is to run in a local Spark context that has a single worker thread, specify local.

--load-samples

This option loads the sample Spark application code contained in the following files into your $HOME/spark/apps directory:

idax_examples.jar
ReadExample.py
ReadExampleJson.py
ReadWriteExampleKMeans.py
ReadWriteExampleKMeansJson.py
ExceptionExample.py
SqlPredicateExample.py
example_utilities.egg
ReadExample.R
ReadExampleJson.R
ReadWriteExampleKMeans.R
ReadWriteExampleKMeansJson.R
ExceptionExample.R
SqlPredicateExample.R
example_utilities.R

--upload-file

This option uploads the specified file from the specified source directory on your client system to the directory that corresponds to the specified option:

apps: Upload to the $HOME/spark/apps directory. Use this target for files that are to be available to a particular application. This is the default option.
defaultlibs: Upload to the $HOME/spark/defaultlibs directory. Use this target for files that are to be available to all of your applications.
globallibs: Upload to the /globallibs directory. This option requires administrator access authority. Use this target for files that are to be available to all applications for all users.

--user

An administrator can issue some commands on behalf of another user, for example to upload files to or download or delete files from that user's $HOME/spark/apps or $HOME/spark/defaultlibs directories. This option specifies the name of the user on whose behalf a command is being issued.

--download-file

This option downloads the specified file to the current directory on your client system from the directory that corresponds to the specified option:

apps: Download from the $HOME/spark/apps directory. Use this target for files that contain application code that you want to submit to Spark. This is the default option.
defaultlibs: Download from the $HOME/spark/defaultlibs directory. Use this target for files that are to be available to all of your applications.
globallibs: Download from the /globallibs directory. This option requires administrator access authority. Use this target for files that are to be available to all applications for all users.

--dir

The name of an existing directory on your client system into which a file is to be downloaded, for example:

--dir target1: The target directory is a subdirectory of the current directory.
--dir /target1: The target directory is a subdirectory of the root directory.
--dir ../target1: The target directory is at the same level as the current directory.

--list-files

This option lists the files in the directory that corresponds to the specified option:

apps: List files in the $HOME/spark/apps directory. This is the default option.
defaultlibs: List files in the $HOME/spark/defaultlibs directory.
globallibs: List files in the /globallibs directory. This option requires administrator access authority.

--delete-file

This option deletes the specified file from the directory that corresponds to the specified option:

apps: Delete from the $HOME/spark/apps directory. This is the default option.
defaultlibs: Delete from the $HOME/spark/defaultlibs directory.
globallibs: Delete from the globallibs directory. This option requires administrator access authority.

--cluster-status

This option retrieves information about the status of your Spark cluster, such as the number of applications that are currently running in it.

--app-status

This option retrieves information about the status of the application with the specified submission ID.

--list-apps

This option retrieves information about all applications that are currently running or that ran since the cluster was last started.

--download-cluster-logs

This option retrieves the standard error and standard output logs of the master and worker process of your Spark cluster.

--download-app-logs

This option retrieves all the log files for the application with the specified submission ID. If no submission ID is specified, the log files of the most recently application are retrieved.

--kill

This option cancels the running application with the specified submission ID.

--jsonout

This option specifies that the spark-submit.sh script is to display its output in JSON format. This option overrides the setting of the DASHDBJSONOUT environment variable.

--display-cluster-log

This option displays the contents of the cluster log file of the indicated type:

out: Standard output log file.
err: Standard error log file.
master: Log file of the master node of the cluster.
worker: Log file of the worker node with the specified IP address.

--display-app-log

This option displays, for the application with the specified submission ID, the contents of the log file of the indicated type:

app: Application log file.
out: Standard output log file.
err: Standard error log file.
info: Information log file containing a return code, messages, and an exception log from the application in JSON format.

If no submission ID is specified, the log file of the most recently submitted application is displayed.

--webui-url

This option retrieves the URL of the Spark Web User Interface (UI), which you can use to monitor clusters. This URL has a name of the form:

http://dashDB-hostname:port

--env

This option retrieves the settings of the following environment variables:

DASHDBURL: Sets the URL of the Db2 Warehouse web console.
DASHDBUSER: Sets the user name used to log in to the Db2 Warehouse web console.
DASHDBPASS: Sets the password used to log in to the Db2 Warehouse web console.
DASHDBJSONOUT: Specifies whether output is to be returned in JSON (YES) or readable (NO) format.

--version

This option displays the version of the spark-submit.sh script and the build level of the Db2 Warehouse that you are using.

--help

This option displays a description of the syntax of the spark-submit.sh script.

Output formats

The DASHDBJSONOUT environment variable specifies the default format for output that is returned by the spark-submit.sh script:

DASHDBJSONOUT=YES

Output is returned in JSON format. This format is suitable for processing by other programs. For example, the output from a --clusterstatus command will look similar to this:

{"statusDesc":"Cluster is running.","resultCode":200,"clusters":[{"running_jobs":1,
"monitoring_url":"http:\/\/9.152.63.165:25005","username":"user42"}],"username":
"user42","status":"running"}

DASHDBJSONOUT=NO

Output is returned in readable format. This is the default. For example, the output from a --clusterstatus command will look similar to this:

status: Running
statusDesc: Cluster is running.
running jobs: 1

You can override this setting for a single command by specifying the --jsonout option for that command.

Examples

Launch an application written in Scala based on the application code in the idax_examples.jar file, which is located in the jars subdirectory of the current directory on the client system:
```
./spark-submit.sh jars/idax_examples.jar --loc client --class com.ibm.idax.spark.examples.ReadExample
```
Launch an application written in Python based on the application code in the ReadExample.py file that also requires the contents of the example_utilities.egg file (both files are on the host):
```
./spark-submit.sh ReadExample.py --py-files example_utilities.egg --loc host
```
Launch an application written in R based on the application code in the file ReadExample.R:
```
./spark-submit.sh ReadExample.R
```
Cancel the job with the submission ID 20160815210608126000:
```
./spark-submit.sh --kill 20160815210608126000
```
List the files contained in the $HOME/spark/apps directory:
```
./spark-submit.sh --list-files apps
```