spark-submit.sh script
You can use the provided spark-submit.sh script to launch and manage your Apache Spark applications from a client machine. This script recognizes a subset of the configuration properties used by the spark-submit script provided by Apache Spark. It also introduces several additional commands and environment variables that are specific to the management of Spark applications within Db2® Warehouse.
Restrictions
The spark-submit.sh script for Db2 Warehouse:
- Can be used only on Linux® and MacOS operating systems
- Can be used only to submit Scala, Java™, R, and Python applications
- Does not support SSL certificate verification (certificates are ignored and no peer verification takes place)
Prerequisite
To use the spark-submit.sh script, you must first download and install the cURL command-line tool. For more information, see Required REST tooling.
Before you begin
Before you begin:
- Download the spark-submit.sh script from the console. To do this, click .
- Enter one or more of the following export commands to set environment
variables that simplify the use of spark-submit.sh:
Setting these environment variables eliminates the need for you to specify the same information each time you use spark-submit.sh.export DASHDBURL="https://hostname:8443 # The URL of your dashDB web console. export DASHDBUSER=user-name # Your user name. export DASHDBPASS=password # Your password. export DASHDBJSONOUT=YES|NO # Whether output is to be returned in JSON (YES) or readable (NO) format.
Syntax
Description
- file_name
- The name of the file that contains the application code that is to be submitted.
- arguments
- Specify any arguments that are required as input for the application that is being run.
- --class
- For application code that is written in Java or Scala, this option specifies the name of the main class.
- --jars
- For application code that is written in Java or Scala, this option specifies a comma-separated list of any .jar files that are used by the application. These files must be in the $HOME/spark/apps directory.
- --py-files
- For application code that is written in Python, this option specifies a comma-separated list of any .py, .zip, or .egg files that are used by the application. These files must be in the $HOME/spark/apps directory.
- --name
- This option specifies the name that is to be assigned to the application that is being launched. If this option is not specified, the name is set to the class name (for a Java or Scala application) or file name (for a Python or R application).
- --loc
- This option specifies the location of the file that contains the application code that is to be submitted:
- host
- The file is located on the Db2 Warehouse host system. Any path information specified in the file name indicates the path, relative to the $HOME/spark/apps directory, to the file. This is the default.
- client
- The file is located on your client system. Any path information specified in the file name indicates the path, relative to the current directory, to the file. The file is automatically deployed to the $HOME/spark/apps/temp directory before being submitted, and any file with the same name that is already in that directory is overwritten.
- The file is $HOME/spark/apps/subdir6/cool.jar, on the
host:
./spark-submit.sh --class c.myclass subdir6/cool.jar --loc host
- The file is ./subdir6/cool.jar, on the
client:
./spark-submit.sh --class c.myclass subdir6/cool.jar --loc client
- --master
- If the Db2 Warehouse URL to which the application is to be submitted is different from the Db2 Warehouse URL that is currently set, use this option to specify the new URL. If the application is to run in a local Spark context that has a single worker thread, specify local.
- --load-samples
- This option loads the sample Spark application code contained in the following files into your
$HOME/spark/apps
directory:
idax_examples.jar ReadExample.py ReadExampleJson.py ReadWriteExampleKMeans.py ReadWriteExampleKMeansJson.py ExceptionExample.py SqlPredicateExample.py example_utilities.egg ReadExample.R ReadExampleJson.R ReadWriteExampleKMeans.R ReadWriteExampleKMeansJson.R ExceptionExample.R SqlPredicateExample.R example_utilities.R
- --upload-file
- This option uploads the specified file from the specified source directory on your client system
to the directory that corresponds to the specified option:
- apps
- Upload to the $HOME/spark/apps directory. Use this target for files that are to be available to a particular application. This is the default option.
- defaultlibs
- Upload to the $HOME/spark/defaultlibs directory. Use this target for files that are to be available to all of your applications.
- globallibs
- Upload to the /globallibs directory. This option requires administrator access authority. Use this target for files that are to be available to all applications for all users.
- --user
- An administrator can issue some commands on behalf of another user, for example to upload files to or download or delete files from that user's $HOME/spark/apps or $HOME/spark/defaultlibs directories. This option specifies the name of the user on whose behalf a command is being issued.
- --download-file
- This option downloads the specified file to the current directory on your client system from the
directory that corresponds to the specified option:
- apps
- Download from the $HOME/spark/apps directory. Use this target for files that contain application code that you want to submit to Spark. This is the default option.
- defaultlibs
- Download from the $HOME/spark/defaultlibs directory. Use this target for files that are to be available to all of your applications.
- globallibs
- Download from the /globallibs directory. This option requires administrator access authority. Use this target for files that are to be available to all applications for all users.
- --dir
- The name of an existing directory on your client system into which a file is to be downloaded,
for example:
- --dir target1
- The target directory is a subdirectory of the current directory.
- --dir /target1
- The target directory is a subdirectory of the root directory.
- --dir ../target1
- The target directory is at the same level as the current directory.
- --list-files
- This option lists the files in the directory that corresponds to the specified option:
- apps
- List files in the $HOME/spark/apps directory. This is the default option.
- defaultlibs
- List files in the $HOME/spark/defaultlibs directory.
- globallibs
- List files in the /globallibs directory. This option requires administrator access authority.
- --delete-file
- This option deletes the specified file from the directory that corresponds to the specified option:
- apps
- Delete from the $HOME/spark/apps directory. This is the default option.
- defaultlibs
- Delete from the $HOME/spark/defaultlibs directory.
- globallibs
- Delete from the globallibs directory. This option requires administrator access authority.
- --cluster-status
- This option retrieves information about the status of your Spark cluster, such as the number of applications that are currently running in it.
- --app-status
- This option retrieves information about the status of the application with the specified submission ID.
- --list-apps
- This option retrieves information about all applications that are currently running or that ran since the cluster was last started.
- --download-cluster-logs
- This option retrieves the standard error and standard output logs of the master and worker process of your Spark cluster.
- --download-app-logs
- This option retrieves all the log files for the application with the specified submission ID. If no submission ID is specified, the log files of the most recently application are retrieved.
- --kill
- This option cancels the running application with the specified submission ID.
- --jsonout
- This option specifies that the spark-submit.sh script is to display its output in JSON format. This option overrides the setting of the DASHDBJSONOUT environment variable.
- --display-cluster-log
- This option displays the contents of the cluster log file of the indicated type:
- out
- Standard output log file.
- err
- Standard error log file.
- master
- Log file of the master node of the cluster.
- worker
- Log file of the worker node with the specified IP address.
- --display-app-log
- This option displays, for the application with the specified submission ID, the contents of the
log file of the indicated type:
- app
- Application log file.
- out
- Standard output log file.
- err
- Standard error log file.
- info
- Information log file containing a return code, messages, and an exception log from the application in JSON format.
- --webui-url
- This option retrieves the URL of the Spark Web User Interface (UI), which you can use to monitor
clusters. This URL has a name of the
form:
http://dashDB-hostname:port
- --env
- This option retrieves the settings of the following environment variables:
- DASHDBURL
- Sets the URL of the Db2 Warehouse web console.
- DASHDBUSER
- Sets the user name used to log in to the Db2 Warehouse web console.
- DASHDBPASS
- Sets the password used to log in to the Db2 Warehouse web console.
- DASHDBJSONOUT
- Specifies whether output is to be returned in JSON (YES) or readable (NO) format.
- --version
- This option displays the version of the spark-submit.sh script and the build level of the Db2 Warehouse that you are using.
- --help
- This option displays a description of the syntax of the spark-submit.sh script.
Output formats
The DASHDBJSONOUT environment variable specifies the default format for output
that is returned by the spark-submit.sh script:
- DASHDBJSONOUT=YES
- Output is returned in JSON format. This format is suitable for processing by other programs. For
example, the output from a
--clusterstatus
command will look similar to this:{"statusDesc":"Cluster is running.","resultCode":200,"clusters":[{"running_jobs":1, "monitoring_url":"http:\/\/9.152.63.165:25005","username":"user42"}],"username": "user42","status":"running"}
- DASHDBJSONOUT=NO
- Output is returned in readable format. This is the default. For example, the output from a
--clusterstatus command will look similar to
this:
status: Running statusDesc: Cluster is running. running jobs: 1
Examples
- Launch an application written in Scala based on the application code in the
idax_examples.jar file, which is located in the jars
subdirectory of the current directory on the client
system:
./spark-submit.sh jars/idax_examples.jar --loc client --class com.ibm.idax.spark.examples.ReadExample
- Launch an application written in Python based on the application code in the
ReadExample.py file that also requires the contents of the
example_utilities.egg file (both files are on the
host):
./spark-submit.sh ReadExample.py --py-files example_utilities.egg --loc host
- Launch an application written in R based on the application code in the file
ReadExample.R:
./spark-submit.sh ReadExample.R
- Cancel the job with the submission ID
20160815210608126000:
./spark-submit.sh --kill 20160815210608126000
- List the files contained in the $HOME/spark/apps
directory:
./spark-submit.sh --list-files apps