API syntax for submitting Spark runtime
Applies to :
Spark engine
Apache Gluten accelerated Spark engine
- Required permissions
- To submit Spark runtime, you must have the User role.
API syntax for submitting Spark runtime
You typically submit a Spark runtime with a CURL command.
The Spark runtime CURL command syntax is:
curl -k -X POST --url https://<cpd_host_name>/lakehouse/api/<api_version>/spark_engines/<spark_engine_id>/applications \ -H "Authorization: ZenApiKey <TOKEN> -d @input.json
Replace the variables as follows:
- <cpd_host_name>: The hostname of your IBM Software Hub.
- <api_version>: When using the v2 API, set the
<api_version>parameter tov2; for thev3API, set it tov3. - <spark_engine_id> : The Engine ID of the native Spark engine.
- <token> : The bearer token. For more information about generating the token, see Generating a bearer token.
Note:
The POST method returns after the initial validation of the application. The runtime request is processed asynchronously; first the Spark context is created and then the application is executed. The current status of the application can be fetched by using the GET method.
An example of an input payload for a Python runtime:
{"application_details":{"application":"/opt/ibm/spark/examples/src/main/python/wordcount.py","arguments":["/opt/ibm/spark/examples/src/main/resources/people.txt"],"conf":{"spark.app.name":"MyJob","spark.eventLog.enabled":"true","spark.driver.memory":"4G","spark.driver.cores":1,"spark.executor.memory":"4G","spark.executor.cores":1,"ae.spark.executor.count":1},"env":{"SAMPLE_ENV_KEY":"SAMPLE_VALUE"}}}
An example of an input payload for a Scala runtime:
{"application_details":{"application":"/opt/ibm/spark/examples/jars/spark-examples*.jar","arguments":["1"],"class":"org.apache.spark.examples.SparkPi","conf":{"spark.app.name":"MyJob","spark.eventLog.enabled":"true","spark.driver.memory":"4G","spark.driver.cores":1,"spark.executor.memory":"4G","spark.executor.cores":1,"ae.spark.executor.count":1},"env":{"SAMPLE_ENV_KEY":"SAMPLE_VALUE"}}}
The returned response if your runtime was successfully submitted:
{
"application_id": "<application_id>",
"state": "ACCEPTED"
}
Hint:
- Save the returned value of "application_id" to get the status of the runtime or to stop the runtime.
- Save the returned value of "application_id" to monitor and analyze the Spark application on the Spark history server.
Customizing Spark runtime configurations
- Using custom Spark version
-
An example of an input payload for changing spark runtime version:
{"application_details":{"application":"/opt/ibm/spark/examples/src/main/python/wordcount.py","arguments":["/opt/ibm/spark/examples/src/main/resources/people.txt"],"runtime":{"spark_version":"3.4"}}}
- Spark runtime API using custom packages
-
An example of an input payload for using custom packages:
{"volumes":[{"name":"cpd-instance::myapp-vol","mount_path":"/my-app"}],"application_details":{"application":"/my-app/python-spark-pi.py","packages":"org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.4.0","conf":{"spark.app.name":"MyJob","spark.eventLog.enabled":"true","spark.driver.cores":4,"spark.driver.memory":"8G","spark.executor.memory":"2G","spark.executor.cores":4,"ae.spark.driver.log.level":"ERROR","ae.spark.executor.log.level":"WARN"}}}
Spark runtime API parameters
These are the parameters you can use in the Spark runtimes API:
| Name | Sub-properties | Required/Optional | Type | Description |
|---|---|---|---|---|
| application_details | Required | Object | Specifies the Spark application details | |
| application | Required | String | Specifies the Spark application file, i.e. the file path to the Python, or scala runtime file | |
| arguments | Optional | String[] | Specifies the application arguments | |
| conf | Optional | Key-value JSON object | Specifies the Spark configuration values that override the predefined values. See Apache Spark configurations for the configuration parameters supported by Apache Spark. | |
| env | Optional | Key-value JSON object | Specifies Spark environment variables required for the runtime. See Apache Spark environment variables for the environment variables supported by Apache Spark. | |
| class | Optional | String | Specifies the entry point for your Scala application. | |
| driver-java-options | Optional | String | Specifies extra Java options to pass to the driver | |
| driver-library-path | Optional | String | Specifies extra library path entries to pass to the driver | |
| driver-class-path | Optional | String | Species extra class path entries to pass to the driver. Note that jars added with
--jars are automatically included in the classpath. |
|
| jars | Optional | String | Specifies a comma-separated list of jars to include on the driver and executor classpaths | |
| packages | Optional | String | Specifies a comma-separated list of Maven coordinates of jars to include on the driver and
executor classpaths. Searches the local Maven repository, then Maven central and finally any
additional remote repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version. |
|
| exclude-packages | Optional | String | Specifies a comma-separated list of groupId:artifactId to exclude while
resolving the dependencies provided in --packages to avoid dependency
conflicts |
|
| repositories | Optional | String | Specifies a comma-separated list of additional remote repositories to search for the Maven
coordinates given with --packages |
|
| py-files | Optional | String | Specifies a comma-separated list of .zip, .egg, or
.py files to place on the PYTHONPATH for Python apps |
|
| runtime.spark_version | Optional | String | Specifies Spark runtime version to be used for the runtime. IBM Cloud Pak for Data supports Spark 3.4. | |
| idempotency_key | Optional | String | Specifies a key to ensure that repeated requests with the same key are treated as a single request. | |
| timeout_in_seconds | Optional | String | Specifies the maximum execution time while submitting an application. | |
| max_retries | Optional | String | Specifies the value till which the application will be retried upon failures. If you do not specify any value, the application runs exactly once. | |
| min_retry_interval_in_seconds | Optional | String | The retry interval is calculated in seconds between the start of the failed run and the subsequent retry run. | |
| volumes | Optional | list of objects | Specifies the volumes to be mounted other than the Spark engine volume. If volumes are added in the application payload, then the conf section in payload is mandatory. | |
| name | Required | String | Specifies the name of the volume | |
| source_sub_path | Optional | String | Specifies the source path in the volume to be mounted. Source path MUST be a relative path. | |
| mount_path | Required | String | Specifies the location where the volume is to be mounted. Note that there are a few prohibited mount paths, which you will be restricted from using when you try to enter them as these can compromise the runtime. |
[/, /bin, /boot, /dev, /etc, /home, /lib, /lib64, /licenses, /lost+found, /media, /mnt, /opt, /proc, /root, /run, /sbin, /space_data, /project_data, /srv,
/sys, /tmp, /usr, /var, /home/spark/shared, /home/spark/spark-events, /home/spark/space/assets, /home/spark/project/assets]]The Spark runtimes API returns the following response codes:
| Return code | Meaning of the return code | Description |
|---|---|---|
| 202 | runtime accepted | The Spark runtime is successfully validated and accepted for submitting the application. |
| 400 | Bad request | This is returned when the payload is incorrect, for example, if the payload format is incorrect or arguments are missing. |
| 404 | Not Found | This is returned when the Spark application is submitted for instance ID that does not exist. |
| 500 | Internal server error | This is returned when the server isn’t responding to what you’re asking it to do. Try submitting your runtime again. |
| 503 | Service unavailable | This is returned when there are insufficient resources. Possible response: Could not
complete the request. Reason - FailedScheduling. |