Submitting Spark jobs via API

In IBM Cloud Pak for Data, you can run Spark jobs or applications on your IBM Cloud Pak for Data cluster without installing Watson Studio by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark.

Service This service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Submitting Spark jobs

You can submit any Spark application that runs Spark SQL or data transformation, data science and machine learning jobs by using the Spark jobs REST API. Each submitted job runs in a dedicated cluster. Any configuration settings that you pass through the jobs API, will override the default configurations.

To submit a Spark job:

Get the service endpoint to the Spark job API:
1. From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
2. Under Access information, copy and save the Spark jobs endpoint.
Generate a token if you haven’t already done so. See Generate an access token.

Submit the job using the endpoint and access token that you generated. The job that you want to submit can be located in the file storage system on the IBM Cloud Pak for Data cluster or in IBM Cloud Object Storage. See Storage considerations.

The following examples show how to submit a word count application to the Spark instance. See Spark jobs API syntax for a description of the submit Spark jobs syntax, the parameters you can use and the returned error codes.

The first example includes the minimal mandatory parameters that are required:

 curl -k -X POST <JOB_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{"engine":{"type":"spark"},"application_arguments":["/opt/ibm/spark/examples/src/main/resources/people.txt"],"application": "/opt/ibm/spark/examples/src/main/python/wordcount.py"}' 

The second example builds on the first example and includes how to customize the cluster hardware sizes:

 curl -k -X POST <JOB_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{"engine":{"type":"spark","template_id":"spark-2.4.0-jaas-v2-cp4d-template","conf":{"spark.app.name":"myJob"},"size":{"num_workers":"1","worker_size":{"cpu":1,"memory":"1g"},"driver_size":{"cpu":1,"memory":"1g"}}},"application_arguments":["1"],"application_jar":"/opt/ibm/spark/examples/jars/spark-examples*.jar","main_class":"org.apache.spark.examples.SparkPi"}' 

Viewing Spark job status

After you have submitted your Spark job, you can view the job details, including the run status. You can do this:

In the deployment space where your job ran
1. From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
2. Copy the URL to view the deployment space to a new browser window. This opens the deployment space on the Jobs tab where you can view the Spark jobs.
3. Click the job to see the job runs.
4. Click a job run to view the run details and log tail. You can download the complete log for the run by clicking Download log.
  
  Note that all jobs submitted to the instance are listed. When the job stops, all cluster resources are released.

By using the Spark Jobs API

For example:

To list all active jobs, use the following cURL command:
```
  curl -k -X GET <JOB_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>"
```
Note that jobs in FINISHED state are removed and don’t show up in the list of all jobs.

To get the status of a job, use:

  curl -k -X GET <JOB_API_ENDPOINT>/<job_id> -H "Authorization: Bearer <ACCESS_TOKEN>"

Example response:

  {
      "jobId": "JOB_ID",
      "job_state": "FINISHED"
  }

To get the driver state, use:

  curl -k -X GET <JOB_API_ENDPOINT>/<job_id>?driver_state=true -H "Authorization: Bearer <ACCESS_TOKEN>"

Example response:

  { 
      "jobId": "JOB_ID", 
      "job_state": "FINISHED",
      "driver_state": "FINISHED" 
  }

Through the Spark history server.

Analyse how your Spark job performed by viewing the performance metrics, partitions, and execution plans of the completed jobs in the Spark history server UI. See Accessing the Spark history server.

Deleting Spark jobs

You can delete a Spark job:

In the deployment space where your job ran
1. From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
2. Copy the URL to view the deployment space to a new browser window. This opens the deployment space on the Jobs tab where you can view the Spark jobs.
3. Find your job, click it to open the job’s details page and cancel the job run.
By using the Spark Jobs API

To delete a Spark job, use:
```
  curl -k -X DELETE <JOB_API_ENDPOINT>/<job-id> -H "Authorization: Bearer <ACCESS_TOKEN>"
```
Returns 204 No Content if the job was successfully deleted.

Submitting Spark jobs via API