Important:

IBM Cloud Pak® for Data Version 4.7 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.7 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.

Spark jobs API syntax, parameters and return codes

You typically submit a Spark job with a cURL command.

The Spark job cURL command syntax is:

curl -k -X POST <V4_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d @input.json

Spark jobs cURL options:

  • The -k option skips certificate validation as the service instance website uses a self-signed SSL certificate.
  • <V4_JOBS_API_ENDPOINT> is the endpoint for the instance that you want to use to submit your Spark job. Note that multiple Analytics Engine Powered by Apache Spark instances can exist on the IBM Cloud Pak for Data server and each instance has its own endpoint for submitting jobs. To get the Spark jobs endpoint for your provisioned instance, see Administering the service instance.
  • The -H option is the header parameter. The header parameter is a key value pair. You must send the bearer token (<ACCESS_TOKEN>) in an authorization header. To get the access token for your service instance, see Generating an access token.
  • The -d option defines the payload data to be sent in a POST request to the server. See the example of an input payload below.
Note: The POST method returns after the initial validation of the application. The job request is processed asynchronously; first the SparkContext is created and then the application is executed. The current status of the application can be fetched by using the GET method. See [Spark job status](spark-jobs.html#spark-job-status).

An example of an input payload for a Python job:

{
"application_details": {
        "application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
        "arguments": ["/opt/ibm/spark/examples/src/main/resources/people.txt"],
        "conf": {
                "spark.app.name": "MyJob",
                "spark.eventLog.enabled": "true",
                "spark.driver.memory": "4G",
                "spark.driver.cores": 1,
                "spark.executor.memory": "4G",
                "spark.executor.cores": 1,
                "ae.spark.executor.count": 1
                },
        "env": {
                "SAMPLE_ENV_KEY": "SAMPLE_VALUE"
                }
        }
}

An example of an input payload for an R job:

{
"application_details": {
        "application": "/opt/ibm/spark/examples/src/main/r/dataframe.R",
        "conf": {
                "spark.app.name": "MyJob",
                "spark.eventLog.enabled": "true",
                "spark.driver.memory": "4G",
                "spark.driver.cores": 1,
                "spark.executor.memory": "4G",
                "spark.executor.cores": 1,
                "ae.spark.executor.count": 1
                },
        "env": {
                "SAMPLE_ENV_KEY": "SAMPLE_VALUE"
                }
        }
}

An example of an input payload for a Scala job:

{
"application_details": {
        "application": "/opt/ibm/spark/examples/jars/spark-examples*.jar",
        "arguments": ["1"],
        "class": "org.apache.spark.examples.SparkPi",
        "conf": {
                "spark.app.name": "MyJob",
                "spark.eventLog.enabled": "true",
                "spark.driver.memory": "4G",
                "spark.driver.cores": 1,
                "spark.executor.memory": "4G",
                "spark.executor.cores": 1,
                "ae.spark.executor.count": 1
                },
        "env": {
                "SAMPLE_ENV_KEY": "SAMPLE_VALUE"
                }
        }
}

The returned response if your job was successfully submitted:

{
"application_id": "<application_id>",
"state": "ACCEPTED"
}

Hint:

  • Save the returned value of "application_id" to get the status of the job or to stop the job.
  • Save the returned value of "spark_application_id" to monitor and analyze the Spark application on the Spark history server.

Spark jobs API parameters

These are the parameters you can use in the Spark jobs API:

Table 1. Parameters for the Spark jobs API
Name Sub-properties Required/Optional Type Description
application_details Required Object Specifies the Spark application details
application Required String Specifies the Spark application file, i.e. the file path to the Python, R, or scala job file
arguments Optional String[] Specifies the application arguments
conf Optional Key-value JSON object Specifies the Spark configuration values that override the predefined values. See the section Default Spark configuration parameters and environment variables for the default configuration parameters defined by the Spark service. See Apache Spark configurations for the configuration parameters supported by Apache Spark.
env Optional Key-value JSON object Specifies Spark environment variables required for the job. See the section Default Spark configuration parameters and environment variables for the default environment variables defined by the Spark service. See Apache Spark environment variables for the environment variables supported by Apache Spark.
class Optional String Specifies the entry point for your Scala application.
driver-java-options Optional String Specifies extra Java options to pass to the driver
driver-library-path Optional String Specifies extra library path entries to pass to the driver
driver-class-path Optional String Species extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath.
jars Optional String Specifies a comma-separated list of jars to include on the driver and executor classpaths
packages Optional String Specifies a comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. Searches the local Maven repository, then Maven central and finally any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.
exclude-packages Optional String Specifies a comma-separated list of groupId:artifactId to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts
repositories Optional String Specifies a comma-separated list of additional remote repositories to search for the Maven coordinates given with --packages
py-files Optional String Specifies a comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps
volumes Optional list of objects Specifies the volumes to be mounted other than the Spark instance volume. If volumes are added in the application payload, then the conf section in payload is mandatory.
name Required String Specifies the name of the volume
source_sub_path Optional String Specifies the source path in the volume to be mounted. Source path MUST be a relative path.
mount_path Required String Specifies the location where the volume is to be mounted. Note that there are a few prohibited mount paths, which you will be restricted from using when you try to enter them as these can compromise the runtime.

Note that the default memory allocation for the Spark driver and executor is 1 G only. Although, an extra 1 G of memory is available in both the Spark driver and executor pods, this is reserved to run the Spark daemon. You can't use this extra 1 G, nor is it allocated from your instance memory quota. For example, assuming you defined a job with both spark.executor.memory and spark.driver.memory set to 2 G running 4 executors, the memory allocation would be 2 G for the driver and 4*2 G for executors, totaling 10 G. So an instance with a 60 G memory limit, could run at most 6 of these jobs concurrently.

Response codes

The Spark jobs API returns the following response codes:

Spark job API response codes
Return code Meaning of the return code Description
202 Job accepted The Spark job is successfully validated and accepted for submitting the application.
400 Bad request This is returned when the payload is incorrect, for example, if the payload format is incorrect or arguments are missing.
404 Not Found This is returned when the Spark application is submitted for instance ID that does not exist.
500 Internal server error This is returned when the server isn’t responding to what you’re asking it to do. Try submitting your job again.
503 Service unavailable This is returned when there are insufficient resources. 
Possible response: Could not complete the request. Reason - FailedScheduling.

Default Spark configuration parameters and environment variables

The following tables show the Spark configuration parameters and environment variables that are commonly used in Analytics Engine powered by Apache Spark and their default values.

The following table lists the Spark configuration parameters and their defaults:

Default Spark configuration parameters
Spark configuration Default value
spark.eventLog.enabled TRUE
spark.executor.extraClassPath /home/spark/space/assets/data_asset/*:/home/spark/user_home/dbdrivers/*:/cc-home/_global_/dbdrivers/*:/home/spark/shared/user-libs/spark2/*:/home/spark/user_home/dbdrivers/*:/home/spark/shared/user-libs/common/*:/home/spark/shared/user-libs/connectors/*:/opt/ibm/connectors/parquet-encryption/*:/opt/ibm/third-party/libs/spark2/*:/opt/ibm/third-party/libs/common/*:/opt/ibm/third-party/libs/connectors/*:/opt/ibm/spark/external-jars/*
spark.executer.memory 1 G
spark.executer.cores 1
(custom) ae.spark.executor.count 1
(custom) ae.spark.application.priority 1
spark.driver.extraClassPath /home/spark/space/assets/data_asset/*:/home/spark/user_home/dbdrivers/*:/cc-home/_global_/dbdrivers/*:/home/spark/shared/user-libs/spark2/*:/home/spark/user_home/dbdrivers/*:/home/spark/shared/user-libs/common/*:/home/spark/shared/user-libs/connectors/*:/opt/ibm/connectors/parquet-encryption/*:/opt/ibm/third-party/libs/spark2/*:/opt/ibm/third-party/libs/common/*:/opt/ibm/third-party/libs/connectors/*:/opt/ibm/spark/external-jars/*
spark.driver.memory 1024 M
spark.driver.cores 1
spark.local.dir /tmp/spark/scratch See spark.local.dir configuration parameter for details.
spark.master.ui.port 8080
spark.worker.ui.port 8081
spark.ui.port 4040
spark.history.ui.port 18080
spark.ui.enabled TRUE
spark.ui.killEnabled FALSE
spark.eventLog.dir file:///home/spark/spark-events
spark.ui.reverseProxy TRUE
spark.ui.showConsoleProgress TRUE
spark.shuffle.service.port 7337
spark.r.command /opt/ibm/conda/R/bin/Rscript
spark.hadoop.fs.s3a.fast.upload TRUE
spark.hadoop.fs.s3a.multipart.size 33554432
spark.hadoop.fs.stocator.scheme.list cos
spark.hadoop.fs.stocator.cos.scheme cos
spark.hadoop.fs.stocator.glob.bracket.support TRUE
spark.hadoop.fs.stocator.cos.impl com.ibm.stocator.fs.cos.COSAPIClient
spark.hadoop.fs.cos.impl com.ibm.stocator.fs.ObjectStoreFileSystem
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.authenticate FALSE
spark.network.crypto.enabled FALSE
spark.network.crypto.keyLength 256

The following table lists the environment variables and their defaults:

Default Spark environment variables
Environment variable Default value
SPARK_DIST_CLASSPATH /home/spark/space/assets/data_asset/*:/home/spark/user_home/dbdrivers/*:/cc-home/_global_/dbdrivers/*:/opt/ibm/connectors/idax/*:/opt/ibm/connectors/cloudant/*:/opt/ibm/connectors/db2/*:/opt/ibm/connectors/others-db-drivers/*:/opt/ibm/connectors/wdp-connector-driver/*:/opt/ibm/connectors/wdp-connector-jdbc-library/*:/opt/ibm/connectors/stocator/*:/opt/ibm/connectors/s3/*:/opt/ibm/image-libs/common/*:/opt/ibm/image-libs/spark2/*:/opt/ibm/third-party/libs/batch/*:/opt/ibm/spark/external-jars/*
SPARK_LOCAL_DIRS /tmp/spark/scratch
SPARK_MASTER_WEBUI_PORT 8080
SPARK_MASTER_PORT 7077
SPARK_WORKER_WEBUI_PORT 8081
CLASSPATH /home/spark/user_home/dbdrivers/*:/opt/ibm/connectors/idax/*:/opt/ibm/connectors/cloudant/*:/opt/ibm/connectors/db2/*:/opt/ibm/connectors/others-db-drivers/*:/opt/ibm/connectors/wdp-connector-driver/*:/opt/ibm/connectors/wdp-connector-jdbc-library/*:/opt/ibm/connectors/stocator/*:/opt/ibm/connectors/s3/*:/opt/ibm/image-libs/common/*:/opt/ibm/image-libs/spark2/*:/opt/ibm/third-party/libs/batch/*
LD_LIBRARY_PATH /opt/ibm/connectors/dsdriver/dsdriver/lib:/opt/ibm/connectors/others-db-drivers/oracle/lib:/opt/ibm/jdk/jre/lib/architecture/server:/opt/ibm/jdk/jre/lib/architecture/:/usr/local/lib:/lib64
RUNTIME_PYTHON_ENV python310
PYTHONPATH /home/spark/space/assets/data_asset:/home/spark/user_home/python-3:/cc-home/_global_/python-3:/home/spark/shared/user-libs/python:/home/spark/shared/conda/envs/python/lib/python/site-packages:/opt/ibm/conda/miniconda/lib/python/site-packages:/opt/ibm/third-party/libs/python3:/opt/ibm/image-libs/python3:/opt/ibm/image-libs/spark2/xskipper-core.jar:/opt/ibm/image-libs/spark2/spark-extensions.jar:/opt/ibm/image-libs/spark2/metaindexmanager.jar:/opt/ibm/image-libs/spark2/stmetaindexplugin.jar:/opt/ibm/spark/python:/opt/ibm/spark/python/lib/py4j-0.10.7-src.zip
R_LIBS_USER /home/spark/space/assets/data_asset:/home/spark/shared/user-libs/R:/opt/ibm/third-party/libs/R:/opt/ibm/conda/R/lib64/R/library/:/opt/ibm/spark/R/lib:/opt/ibm/image-libs/R

Parent topic: Submitting Spark jobs