Table of contents

Submitting Spark jobs via API

In IBM Cloud Pak for Data, you can run Spark jobs or applications on your IBM Cloud Pak for Data cluster without installing Watson Studio by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark.

Service This service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

A new V3 version of the Spark jobs REST APIs is available which is based on the open source Apache Spark API offering a richer set of functionality. The older V2 API version is deprecated. Although you can still use the V2 APIs, you should start using the V3 API in your applications. For details about the V2 API, see the IBM Cloud Pak for Data 3.5 documentation.

Submitting Spark jobs

You can submit any Spark application that runs Spark SQL or data transformation, data science and machine learning jobs by using the Spark jobs REST API. Each submitted job runs in a dedicated cluster. Any configuration settings that you pass through the jobs API, will override the default configurations.

To submit a Spark job:

  1. Get the service endpoint to the Spark job API:

    1. From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
    2. Under Access information, copy and save the Spark jobs endpoint.
  2. Generate a token if you haven’t already done so. See Generate an access token.
  3. Submit the job using the endpoint and access token that you generated. The job that you want to submit can be located in the file storage system on the IBM Cloud Pak for Data cluster or in IBM Cloud Object Storage. See Storage considerations.

    The following examples show how to submit a word count application to the Spark instance. See Spark jobs API syntax for a description of the submit Spark jobs syntax, the parameters you can use and the returned error codes.

    The first example includes the minimal mandatory parameters that are required:

     curl -k -X POST <V3_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{
         "application_details": {
             "application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
             "application_arguments": ["/opt/ibm/spark/examples/src/main/resources/people.txt"]
            }
     }'
    

    The second example builds on the first example and includes how to customize the cluster hardware sizes:

     curl -k -X POST <V3_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{
         "template_id": "<template_id>",
         "application_details": {
             "application": "/opt/ibm/spark/examples/jars/spark-examples*.jar",
             "application_arguments": ["1"],
             "class": "org.apache.spark.examples.SparkPi",
             "driver-memory": "4G",
             "driver-cores": 1,
             "executor-memory": "4G",
             "executor-cores": 1,
             "num-executors": 1
            }
     } 
    

Spark job status

After you have submitted your Spark job, you can view the job details, including the state of the job runs.

The expected run states are:

State Description
WAITING Spark job was submitted successfully and is waiting for resources to be allocated (usually only for a fraction of a second before it changes to RUNNING as each job is dedicated resources immediately)
RUNNING Spark job was submitted successfully and is running
FAILED Spark job was submitted successfully but Spark application execution failed/returned a non-zero exit code
FINISHED Spark job was submitted successfully and Spark application execution completed successfully with a zero exit code
UNKNOWN Spark job submission was successful, but an error occurred while getting the state of the application

You can view the status:

  • By using the Spark Jobs API:

    • List all active jobs:

        curl -k -X GET <V3_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>"
      

      Note that jobs in FINISHED state are removed and don’t show up in the list of all jobs.

    • Get the status of a job:

        curl -k -X GET <V3_JOBS_API_ENDPOINT>/<job_id> -H "Authorization: Bearer <ACCESS_TOKEN>"
      

      Example response:

        {
            "application_id": "<application_id>",
            "state": "RUNNING",
            "start_time": "Monday' 07 June 2021 '14:46:23.237+0000",
            "spark_application_id": "app-20210607144623-0000"
        }    
      
  • In jobs UI if the Spark advanced feature is enabled. See Using advanced features:

    • From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
    • Copy the URL to view the deployment space to a new browser window. This opens the deployment space on the Jobs tab where you can view the Spark jobs.
    • Click the job to see the job runs.
    • Check the Spark application ID, job ID, and the status and duration of jobs from the UI.
    • Click a job run to view the run details and log tail. You can download the complete log for the run by clicking Download log.

      Note that all jobs submitted to the instance are listed. When the job stops, all cluster resources are released.

  • In the Spark history server UI:

    Analyse how your Spark job performed by viewing the performance metrics, partitions, and execution plans of the completed jobs in the Spark history server UI. See Accessing the Spark history server.

Deleting Spark jobs

You can delete a Spark job:

  • By using the Spark Jobs API:

      curl -k -X DELETE <V3_JOBS_API_ENDPOINT>/<job-id> -H "Authorization: Bearer <ACCESS_TOKEN>"
    

    Returns 204 No Content if the job was successfully deleted.

  • In jobs UI if the Spark advanced features are enabled. See Using advanced features:

    • From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
    • Copy the URL to view the deployment space to a new browser window. This opens the deployment space on the Jobs tab where you can view the Spark jobs.
    • Find your job, click it to open the job’s details page and cancel the job run.

Learn more