Spark jobs API syntax, parameters and return codes

You typically submit a Spark job with a cURL command.

The Spark job cURL command syntax is:

curl -k -X POST <V3_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d @input.json

Spark jobs cURL options:

The -k option skips certificate validation as the service instance website uses a self-signed SSL certificate.
<V3_JOBS_API_ENDPOINT> is the endpoint for the instance that you want to use to submit your Spark job. Note that multiple Analytics Engine Powered by Apache Spark instances can exist on the IBM Cloud Pak for Data server and each instance has its own endpoint for submitting jobs. To get the Spark jobs endpoint for your provisioned instance, see Administering the service instance.
The -H option is the header parameter. The header parameter is a key value pair. You must send the bearer token (<ACCESS_TOKEN>) in an authorization header. To get the access token for your service instance, see Administering the service instance.

The -d option defines the payload data to be sent in a POST request to the server. See the example of an input payload below.

Note: If the application creates a SparkContext, the POST method returns when the SparkSession is created. If the application does not create a SparkContext, the POST method is blocked until the application terminates.

An example of an input payload for a Python job:

  {
      "template_id": "<template_id>",
      "application_details": {
              "application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
              "application_arguments": ["/opt/ibm/spark/examples/src/main/resources/people.txt"],
              "conf": {
                      "spark.app.name": "MyJob",
                      "spark.eventLog.enabled": "true"
                      },
              "env": {
                      "SAMPLE_ENV_KEY": "SAMPLE_VALUE"
                      },
              "driver-memory": "4G",
              "driver-cores": 1,
              "executor-memory": "4G",
              "executor-cores": 1,
              "num-executors": 1
      }
  }

An example of an input payload for a Scala job:

  {
      "template_id": "<template_id>",
      "application_details": {
              "application": "/opt/ibm/spark/examples/jars/spark-examples*.jar",
              "application_arguments": ["1"],
              "class": "org.apache.spark.examples.SparkPi",
              "conf": {
                      "spark.app.name": "MyJob",
                      "spark.eventLog.enabled": "true"
                      },
              "env": {
                      "SAMPLE_ENV_KEY": "SAMPLE_VALUE"
                      },
              "driver-memory": "4G",
              "driver-cores": 1,
              "executor-memory": "4G",
              "executor-cores": 1,
              "num-executors": 1
              }
  }

The POST method returns when the application creates a SparkSession. The application continues to run and the status can be retrieved.

The returned response if your job was successfully submitted:

  {
      "application_id": "<application_id>",
      "state": "RUNNING",
      "start_time": "Monday' 07 June 2021 '14:46:23.237+0000",
      "spark_application_id": "app-20210607144623-0000"
  }

Hint:

Save the returned value of "application_id" to get the status of the job or to stop the job.
Save the returned value of "spark_application_id" to monitor and analyze the Spark application on the Spark history server.

Spark jobs API parameters

These are the parameters you can use in the Spark jobs API:

Table 1. Parameters for the Spark jobs API
Name	Sub-properties	Required/Optional	Type	Description
application_details		Required	Object	Specifies the Spark application details
	application	Required	String	Specifies the Spark application file, i.e. the file path to the Python, R, or scala job file
	application_arguments	Optional	String[]	Specifies the application arguments
	conf	Optional	Key-value JSON object	Specifies the Spark configuration values that override the predefined values. See the section Default Spark configuration parameters and environment variables for the default configuration parameters defined by the Spark service. See Apache Spark configurations for the configuration parameters supported by Apache Spark.
	env	Optional	Key-value JSON object	Specifies Spark environment variables required for the job. See the section Default Spark configuration parameters and environment variables for the default environment variables defined by the Spark service. See Apache Spark environment variables for the environment variables supported by Apache Spark.
	class	Optional	String	Specifies the entry point for your Scala application.
	executor-memory	Optional	String	Specifies the memory per executor, for example 1000M or 2G. The default is 1G.
	executor-cores	Optional	Integer	Specifies the number of cores per executor or all available cores on the worker in standalone mode. The default is 1. The maximum is 5 CPU.
	num-executors	Optional	Integer	Specifies the number of executors to launch. The default is 1.
	driver-cores	Optional	Integer	Specifies the number of cores used by the driver, only in cluster mode. The default is 1. The maximum is 5 CPU.
	driver-memory	Optional	String	Specifies the memory for the driver, for example 1000M or 2G. The default is 1024M.
	driver-java-options	Optional	String	Specifies extra Java options to pass to the driver
	driver-library-path	Optional	String	Specifies extra library path entries to pass to the driver
	driver-class-path	Optional	String	Species extra class path entries to pass to the driver. Note that jars added with `--jars` are automatically included in the classpath.
	jars	Optional	String	Specifies a comma-separated list of jars to include on the driver and executor classpaths
	packages	Optional	String	Specifies a comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. Searches the local Maven repository, then Maven central and finally any additional remote repositories given by `--repositories`. The format for the coordinates should be `groupId:artifactId:version`.
	exclude-packages	Optional	String	Specifies a comma-separated list of `groupId:artifactId` to exclude while resolving the dependencies provided in `--packages` to avoid dependency conflicts
	repositories	Optional	String	Specifies a comma-separated list of additional remote repositories to search for the Maven coordinates given with `--packages`
	py-files	Optional	String	Specifies a comma-separated list of `.zip`, `.egg`, or `.py` files to place on the PYTHONPATH for Python apps
template_id		Optional	String	Specifies the Spark version and preinstalled system libraries. The default is `spark-3.0.0-jaas-v2-cp4d-template` for Spark 3.0. The Spark 2.4 template `spark-2.4.0-jaas-v2-cp4d-template` can only be used if you are on a Cloud Pak for Data version prior to 4.0.7.
volumes		Optional	list of objects	Specifies the volumes to be mounted other than the Spark instance volume. If volumes are added in the application payload, then the conf section in payload is mandatory.
	name	Required	String	Specifies the name of the volume
	source_sub_path	Optional	String	Specifies the source path in the volume to be mounted. Source path MUST be a relative path.
	mount_path	Required	String	Specifies the location where the volume is to be mounted. Note that there are a few prohibited mount paths, which you will be restricted from using when you try to enter them as these can compromise the runtime.

Response codes

The Spark jobs API returns the following response codes:

Spark job API response codes
Return code	Meaning of the return code	Description
201	Job created	The Spark job was successfully submitted. Job response: `{"application_id":"<job_id>", "state":"<job_state>", "start_time": "<start_time>", "spark_application_id": "<spark_app_id>"}`
400	Bad request	This is returned when the payload is incorrect, for example, if the payload format is incorrect or arguments are missing.
404	Not Found	This is returned when the Spark application is submitted for instance ID that does not exist.
500	Internal server error	This is returned when the server isn’t responding to what you’re asking it to do. Try submitting your job again.
503	Service unavailable	This is returned when there are insufficient resources. Possible response: `Could not complete the request. Reason - FailedScheduling. Detailed error - 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.`

Default Spark configuration parameters and environment variables

The following tables show the Spark configuration parameters and environment variables that are commonly used in Analytics Engine Powered by Apache Spark and their default values.

The following table lists the Spark configuration parameters and their defaults:

Default Spark configuration parameters
Spark configuration	Default value
`spark.eventLog.enabled`	TRUE
`spark.driver.extraClassPath`	`/home/spark/space/assets/data_asset/:/home/spark/user_home/dbdrivers/:/cc-home/_global_/dbdrivers/:/home/spark/shared/user-libs/spark2/:/home/spark/user_home/dbdrivers/:/home/spark/shared/user-libs/common/:/home/spark/shared/user-libs/connectors/:/opt/ibm/connectors/parquet-encryption/:/opt/ibm/third-party/libs/spark2/:/opt/ibm/third-party/libs/common/:/opt/ibm/third-party/libs/connectors/:/opt/ibm/spark/external-jars/`
`spark.executor.extraClassPath`	`/home/spark/space/assets/data_asset/:/home/spark/user_home/dbdrivers/:/cc-home/_global_/dbdrivers/:/home/spark/shared/user-libs/spark2/:/home/spark/user_home/dbdrivers/:/home/spark/shared/user-libs/common/:/home/spark/shared/user-libs/connectors/:/opt/ibm/connectors/parquet-encryption/:/opt/ibm/third-party/libs/spark2/:/opt/ibm/third-party/libs/common/:/opt/ibm/third-party/libs/connectors/:/opt/ibm/spark/external-jars/`
`spark.master.ui.port`	8080
`spark.worker.ui.port`	8081
`spark.ui.port`	4040
`spark.history.ui.port`	18080
`spark.ui.enabled`	TRUE
`spark.ui.killEnabled`	FALSE
`spark.eventLog.dir`	`file:///home/spark/spark-events`
`spark.ui.reverseProxy`	TRUE
`spark.ui.showConsoleProgress`	TRUE
`spark.shuffle.service.port`	7337
`spark.r.command`	/opt/ibm/conda/R/bin/Rscript
`spark.hadoop.fs.s3a.fast.upload`	TRUE
`spark.hadoop.fs.s3a.multipart.size`	33554432
`spark.hadoop.fs.stocator.scheme.list`	cos
`spark.hadoop.fs.stocator.cos.scheme`	cos
`spark.hadoop.fs.stocator.glob.bracket.support`	TRUE
`spark.hadoop.fs.stocator.cos.impl`	`com.ibm.stocator.fs.cos.COSAPIClient`
`spark.hadoop.fs.cos.impl`	`com.ibm.stocator.fs.ObjectStoreFileSystem`
`spark.hadoop.fs.s3a.impl`	`org.apache.hadoop.fs.s3a.S3AFileSystem`
`spark.authenticate`	FALSE
`spark.network.crypto.enabled`	FALSE
`spark.network.crypto.keyLength`	256

The following table lists the environment variables and their defaults:

Default Spark environment variables
Environment variable	Default value
SPARK_DIST_CLASSPATH	`/home/spark/space/assets/data_asset/:/home/spark/user_home/dbdrivers/:/cc-home/_global_/dbdrivers/:/opt/ibm/connectors/idax/:/opt/ibm/connectors/cloudant/:/opt/ibm/connectors/db2/:/opt/ibm/connectors/others-db-drivers/:/opt/ibm/connectors/wdp-connector-driver/:/opt/ibm/connectors/wdp-connector-jdbc-library/:/opt/ibm/connectors/stocator/:/opt/ibm/connectors/s3/:/opt/ibm/image-libs/common/:/opt/ibm/image-libs/spark2/:/opt/ibm/third-party/libs/batch/:/opt/ibm/spark/external-jars/*`
SPARK_LOCAL_DIRS	/tmp/spark/scratch
SPARK_MASTER_WEBUI_PORT	8080
SPARK_MASTER_PORT	7077
SPARK_WORKER_WEBUI_PORT	8081
CLASSPATH	`/home/spark/user_home/dbdrivers/:/opt/ibm/connectors/idax/:/opt/ibm/connectors/cloudant/:/opt/ibm/connectors/db2/:/opt/ibm/connectors/others-db-drivers/:/opt/ibm/connectors/wdp-connector-driver/:/opt/ibm/connectors/wdp-connector-jdbc-library/:/opt/ibm/connectors/stocator/:/opt/ibm/connectors/s3/:/opt/ibm/image-libs/common/:/opt/ibm/image-libs/spark2/:/opt/ibm/third-party/libs/batch/`
LD_LIBRARY_PATH	`/opt/ibm/connectors/dsdriver/dsdriver/lib:/opt/ibm/connectors/others-db-drivers/oracle/lib:/opt/ibm/jdk/jre/lib/architecture/server:/opt/ibm/jdk/jre/lib/architecture/:/usr/local/lib:/lib64`
RUNTIME_PYTHON_ENV	python37
PYTHONPATH	/home/spark/space/assets/data_asset:/home/spark/user_home/python-3:/cc-home/_global_/python-3:/home/spark/shared/user-libs/python:/home/spark/shared/conda/envs/python/lib/python/site-packages:/opt/ibm/conda/miniconda/lib/python/site-packages:/opt/ibm/third-party/libs/python3:/opt/ibm/image-libs/python3:/opt/ibm/image-libs/spark2/xskipper-core.jar:/opt/ibm/image-libs/spark2/spark-extensions.jar:/opt/ibm/image-libs/spark2/metaindexmanager.jar:/opt/ibm/image-libs/spark2/stmetaindexplugin.jar:/opt/ibm/spark/python:/opt/ibm/spark/python/lib/py4j-0.10.7-src.zip
R_LIBS_USER	`/home/spark/space/assets/data_asset:/home/spark/shared/user-libs/R:/opt/ibm/third-party/libs/R:/opt/ibm/conda/R/lib64/R/library/:/opt/ibm/spark/R/lib:/opt/ibm/image-libs/R`

Parent topic: Submitting Spark jobs