spark-submit.sh script

You can use the provided spark-submit.sh script to launch and manage your Apache Spark applications from a client machine. This script recognizes a subset of the configuration properties used by the spark-submit script provided by Apache Spark. It also introduces several additional commands and environment variables that are specific to the management of Spark applications within Db2® Warehouse.

Restrictions

The spark-submit.sh script for Db2 Warehouse:

  • Can be used only on Linux® and MacOS operating systems
  • Can be used only to submit Scala, Java™, R, and Python applications
  • Does not support SSL certificate verification (certificates are ignored and no peer verification takes place)

Prerequisite

To use the spark-submit.sh script, you must first download and install the cURL command-line tool. For more information, see Required REST tooling.

Before you begin

Before you begin:

  • Download the spark-submit.sh script from the console. To do this, click ANALYTICS > Spark Analytics. Then, from the options on the right side of the window, click Download spark-submit.sh.
  • Enter one or more of the following export commands to set environment variables that simplify the use of spark-submit.sh:
    export DASHDBURL="https://hostname:8443 # The URL of your dashDB web console.
    export DASHDBUSER=user-name             # Your user name.
    export DASHDBPASS=password              # Your password.
    export DASHDBJSONOUT=YES|NO             # Whether output is to be returned in JSON (YES) or readable (NO) format.
    Setting these environment variables eliminates the need for you to specify the same information each time you use spark-submit.sh.

Syntax

Read syntax diagramSkip visual syntax diagram spark-submit.sh file_nameargumentsapplication options--load-samples--upload-fileappsdefaultlibsgloballibssource_path--useruser_name--download-fileappsdefaultlibsgloballibsfile_name--useruser_name--dirtarget_dir--list-filesappsdefaultlibsgloballibs--useruser_name--delete-fileappsdefaultlibsgloballibspath--useruser_name--cluster-status--app-statussubmission_ID--list-apps--download-cluster-logs--dirtarget_dir--download-app-logssubmission_ID--dirtarget_dir--killsubmission_ID--jsonout--display-cluster-logouterrmasterworkerIP_address--display-app-logappouterrinfosubmission_ID--webui-url--env--version--help
application options
Read syntax diagramSkip visual syntax diagram--classmain_class--jars,file_name--py-files,file_name--nameapplication_id--namename--lochost--locclient--masterhttps://dashDB_host:8443local

Description

file_name
The name of the file that contains the application code that is to be submitted.
arguments
Specify any arguments that are required as input for the application that is being run.
--class
For application code that is written in Java or Scala, this option specifies the name of the main class.
--jars
For application code that is written in Java or Scala, this option specifies a comma-separated list of any .jar files that are used by the application. These files must be in the $HOME/spark/apps directory.
--py-files
For application code that is written in Python, this option specifies a comma-separated list of any .py, .zip, or .egg files that are used by the application. These files must be in the $HOME/spark/apps directory.
--name
This option specifies the name that is to be assigned to the application that is being launched. If this option is not specified, the name is set to the class name (for a Java or Scala application) or file name (for a Python or R application).
--loc
This option specifies the location of the file that contains the application code that is to be submitted:
host
The file is located on the Db2 Warehouse host system. Any path information specified in the file name indicates the path, relative to the $HOME/spark/apps directory, to the file. This is the default.
client
The file is located on your client system. Any path information specified in the file name indicates the path, relative to the current directory, to the file. The file is automatically deployed to the $HOME/spark/apps/temp directory before being submitted, and any file with the same name that is already in that directory is overwritten.
For example, the following two commands specify identical file paths (subdir6/cool.jar) but different file locations:
  • The file is $HOME/spark/apps/subdir6/cool.jar, on the host:
    ./spark-submit.sh --class c.myclass subdir6/cool.jar --loc host
  • The file is ./subdir6/cool.jar, on the client:
    ./spark-submit.sh --class c.myclass subdir6/cool.jar --loc client
--master
If the Db2 Warehouse URL to which the application is to be submitted is different from the Db2 Warehouse URL that is currently set, use this option to specify the new URL. If the application is to run in a local Spark context that has a single worker thread, specify local.
--load-samples
This option loads the sample Spark application code contained in the following files into your $HOME/spark/apps directory:
idax_examples.jar
ReadExample.py
ReadExampleJson.py
ReadWriteExampleKMeans.py
ReadWriteExampleKMeansJson.py
ExceptionExample.py
SqlPredicateExample.py
example_utilities.egg
ReadExample.R
ReadExampleJson.R
ReadWriteExampleKMeans.R
ReadWriteExampleKMeansJson.R
ExceptionExample.R
SqlPredicateExample.R
example_utilities.R
--upload-file
This option uploads the specified file from the specified source directory on your client system to the directory that corresponds to the specified option:
apps
Upload to the $HOME/spark/apps directory. Use this target for files that are to be available to a particular application. This is the default option.
defaultlibs
Upload to the $HOME/spark/defaultlibs directory. Use this target for files that are to be available to all of your applications.
globallibs
Upload to the /globallibs directory. This option requires administrator access authority. Use this target for files that are to be available to all applications for all users.
--user
An administrator can issue some commands on behalf of another user, for example to upload files to or download or delete files from that user's $HOME/spark/apps or $HOME/spark/defaultlibs directories. This option specifies the name of the user on whose behalf a command is being issued.
--download-file
This option downloads the specified file to the current directory on your client system from the directory that corresponds to the specified option:
apps
Download from the $HOME/spark/apps directory. Use this target for files that contain application code that you want to submit to Spark. This is the default option.
defaultlibs
Download from the $HOME/spark/defaultlibs directory. Use this target for files that are to be available to all of your applications.
globallibs
Download from the /globallibs directory. This option requires administrator access authority. Use this target for files that are to be available to all applications for all users.
--dir
The name of an existing directory on your client system into which a file is to be downloaded, for example:
--dir target1
The target directory is a subdirectory of the current directory.
--dir /target1
The target directory is a subdirectory of the root directory.
--dir ../target1
The target directory is at the same level as the current directory.
--list-files
This option lists the files in the directory that corresponds to the specified option:
apps
List files in the $HOME/spark/apps directory. This is the default option.
defaultlibs
List files in the $HOME/spark/defaultlibs directory.
globallibs
List files in the /globallibs directory. This option requires administrator access authority.
--delete-file
This option deletes the specified file from the directory that corresponds to the specified option:
apps
Delete from the $HOME/spark/apps directory. This is the default option.
defaultlibs
Delete from the $HOME/spark/defaultlibs directory.
globallibs
Delete from the globallibs directory. This option requires administrator access authority.
--cluster-status
This option retrieves information about the status of your Spark cluster, such as the number of applications that are currently running in it.
--app-status
This option retrieves information about the status of the application with the specified submission ID.
--list-apps
This option retrieves information about all applications that are currently running or that ran since the cluster was last started.
--download-cluster-logs
This option retrieves the standard error and standard output logs of the master and worker process of your Spark cluster.
--download-app-logs
This option retrieves all the log files for the application with the specified submission ID. If no submission ID is specified, the log files of the most recently application are retrieved.
--kill
This option cancels the running application with the specified submission ID.
--jsonout
This option specifies that the spark-submit.sh script is to display its output in JSON format. This option overrides the setting of the DASHDBJSONOUT environment variable.
--display-cluster-log
This option displays the contents of the cluster log file of the indicated type:
out
Standard output log file.
err
Standard error log file.
master
Log file of the master node of the cluster.
worker
Log file of the worker node with the specified IP address.
--display-app-log
This option displays, for the application with the specified submission ID, the contents of the log file of the indicated type:
app
Application log file.
out
Standard output log file.
err
Standard error log file.
info
Information log file containing a return code, messages, and an exception log from the application in JSON format.
If no submission ID is specified, the log file of the most recently submitted application is displayed.
--webui-url
This option retrieves the URL of the Spark Web User Interface (UI), which you can use to monitor clusters. This URL has a name of the form:
http://dashDB-hostname:port
--env
This option retrieves the settings of the following environment variables:
DASHDBURL
Sets the URL of the Db2 Warehouse web console.
DASHDBUSER
Sets the user name used to log in to the Db2 Warehouse web console.
DASHDBPASS
Sets the password used to log in to the Db2 Warehouse web console.
DASHDBJSONOUT
Specifies whether output is to be returned in JSON (YES) or readable (NO) format.
--version
This option displays the version of the spark-submit.sh script and the build level of the Db2 Warehouse that you are using.
--help
This option displays a description of the syntax of the spark-submit.sh script.

Output formats

The DASHDBJSONOUT environment variable specifies the default format for output that is returned by the spark-submit.sh script:
DASHDBJSONOUT=YES
Output is returned in JSON format. This format is suitable for processing by other programs. For example, the output from a --clusterstatus command will look similar to this:
{"statusDesc":"Cluster is running.","resultCode":200,"clusters":[{"running_jobs":1,
"monitoring_url":"http:\/\/9.152.63.165:25005","username":"user42"}],"username":
"user42","status":"running"}
DASHDBJSONOUT=NO
Output is returned in readable format. This is the default. For example, the output from a --clusterstatus command will look similar to this:
status: Running
statusDesc: Cluster is running.
running jobs: 1
You can override this setting for a single command by specifying the --jsonout option for that command.

Examples

  • Launch an application written in Scala based on the application code in the idax_examples.jar file, which is located in the jars subdirectory of the current directory on the client system:
    ./spark-submit.sh jars/idax_examples.jar --loc client --class com.ibm.idax.spark.examples.ReadExample
  • Launch an application written in Python based on the application code in the ReadExample.py file that also requires the contents of the example_utilities.egg file (both files are on the host):
    ./spark-submit.sh ReadExample.py --py-files example_utilities.egg --loc host
  • Launch an application written in R based on the application code in the file ReadExample.R:
    ./spark-submit.sh ReadExample.R
  • Cancel the job with the submission ID 20160815210608126000:
    ./spark-submit.sh --kill 20160815210608126000
  • List the files contained in the $HOME/spark/apps directory:
    ./spark-submit.sh --list-files apps