Submitting Spark jobs via API

In IBM Cloud Pak for Data, you can run Spark applications on your IBM Cloud Pak for Data cluster without installing Watson Studio by using the Spark jobs REST APIs of IBM Analytics Engine powered by Apache Spark.

Note that the terms Spark application and Spark job are used interchangeably throughout the IBM Analytics Engine powered by Apache Spark documentation.

Service This service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

A new V4 version of the Spark jobs REST APIs is available, which extends the V3 API functionality by including asynchronous job submission and new features. The older V3 API version is deprecated and you are advised to upgrade to the V4 API. If you are still using the V3 API, see the IBM Cloud Pak for Data 4.5 documentation for details.

The V2 API version is also deprecated. Although you can still use the V2 API, you should start using the V4 API in your applications. For details about the V2 API, see the IBM Cloud Pak for Data 3.5 documentation.

Submitting Spark jobs

You can submit any Spark application that runs Spark SQL or data transformation, data science and machine learning jobs by using the Spark jobs REST API. Each submitted job runs in a dedicated cluster. Any configuration settings that you pass through the jobs API, will override the default configurations.

To submit a Spark job:

  1. Get the service endpoint to the Spark job API:

    1. From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
    2. Under Access information, copy and save the Spark jobs endpoint.
  2. Generate a token if you haven't already done so. See Generating an access token.

  3. Submit the job using the endpoint and access token that you generated. The job that you want to submit can be located in the file storage system on the IBM Cloud Pak for Data cluster or in IBM Cloud Object Storage. See Storage considerations.

    Note: Spark applications must return an exit code in all scenarios; non-zero in case of failure or exception, and zero in case of successful execution. If the application creates a SparkContext, the POST method returns when the SparkSession is created. If the application does not create a SparkContext, the POST method is blocked until the application terminates.

The following examples show how to submit a word count application to the Spark instance. See Spark jobs API syntax for a description of the submit Spark jobs syntax, the parameters you can use and the returned error codes.

The first example includes the minimal mandatory parameters that are required:

curl -k -X POST <V4_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{
    "application_details": {
        "application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
        "arguments": ["/opt/ibm/spark/examples/src/main/resources/people.txt"]
    }
}'

The second example builds on the first example and includes how to customize the cluster hardware sizes:

curl -k -X POST <V4_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{
    "application_details": {
        "application": "/opt/ibm/spark/examples/jars/spark-examples*.jar",
        "arguments": ["1"],
        "class": "org.apache.spark.examples.SparkPi",
        "conf": {
            "spark.driver.memory": "4G",
            "spark.driver.cores": 1,
            "spark.executor.memory": "4G",
            "spark.executor.cores": 1,
            "ae.spark.executor.count": 1
        }
    }
}'

Spark job status

After you have submitted your Spark job, you can view the job details, including the state of the job runs.

The expected run states are:

State Description
ACCEPTED Spark job was validated successfully and the process of submitting the application is started
QUEUED Spark job was validated successfully and Spark job is queued due to insufficient resource quota or cluster resources
STARTING Spark job was submitted successfully and is waiting for the Spark application to start
WAITING Spark job was submitted successfully and is waiting for resources to be allocated (usually only for a fraction of a second before it changes to RUNNING as each job is dedicated resources immediately)
RUNNING Spark job was submitted successfully and is running
FAILED Spark job was submitted successfully but Spark application execution failed/returned a non-zero exit code
FINISHED Spark job was submitted successfully and Spark application execution completed successfully with a zero exit code
STOPPED Spark job was submitted successfully and the user had initiated a DELETE API call to cancel the running job.
UNKNOWN Spark job submission was successful, but an error occurred while getting the state of the application

You can view the status:

  • By using the Spark Jobs API:

    • List all active jobs:

      curl -k -X GET <V4_JOBS_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>"
      

      Note that jobs in FINISHED state are removed and don't show up in the list of all jobs.

    • Get the status of a job:

      curl -k -X GET <V4_JOBS_API_ENDPOINT>/<job_id> -H "Authorization: Bearer <ACCESS_TOKEN>"
      

      Example response:

      {
          "application_id": "28ce7f71-a357-4583-9de8-6607047ca783",
          "state": "RUNNING",
          "start_time": "Monday' 07 June 2021 '14:46:23.237+0000",
          "spark_application_id": "app-20210607144623-0000"
      }
      
    • List all jobs with certain states:

      The following example lists all jobs in RUNNING or FAILED state. The different job states that you can query through the API are listed in the table at the beginning of this section.

      curl -k -X GET <V4_JOBS_API_ENDPOINT>?state=WAITING,RUNNING,FAILED,UNKNOWN,STOPPED
      
  • In jobs UI if the Spark advanced feature is enabled. See Using advanced features:

    • From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.

    • Click the open and close list of options icon on the right of the instance details page and select Deployment Space to open the deployment space on the Jobs tab where you can view the Spark jobs.

    • Click the job to see the job runs.

    • Check the Spark application ID, job ID, and the status and duration of jobs from the UI.

    • Click a job run to view the run details and log tail. You can download the complete log for the run by clicking Download log.

      Note that all jobs submitted to the instance are listed. When the job stops, all cluster resources are released.

  • In the Spark history server UI:

    Analyse how your Spark job performed by viewing the performance metrics, partitions, and execution plans of the completed jobs in the Spark history server UI. See Accessing the Spark history server.

Queuing in Spark jobs V4 APIs

Prerequisites
You must have the Cloud Pak for Data Scheduler installed and enabled in the Analytics Engine Custom Resource.
To enable Cloud Pak for Data Scheduler for Analytics Engine, the project administrator needs to specify the spec-serviceConfig.schedulerForQuotaAndQueuing configuration for the service-level custom resource. For information, see Specifying additional configurations for Analytics Engine powered by Apache Spark.

After a Spark application is submitted, the state response changes to ACCEPTED. If you create Spark applications with a V4-based resource quota, the application state automatically changes to QUEUED, based on the following conditions:

  • The job exceeds the instance-level quota for CPU and memory
  • The cluster doesn't have enough CPU or memory for running the Spark job

After these conditions are met, the Spark job is immediately scheduled and the application state changes to RUNNING.

If there are multiple Spark jobs in queue, then the jobs are scheduled in the order of their priority. Spark jobs that are set with higher priority are scheduled based on the conditions first. By default, all Spark jobs have priority of ae.spark.application.priority. This priority can be set by providing the Spark configuration in the job API payload. For information, see Default Spark configuration parameters and environment variables.

Stopping Spark jobs

You can stop a Spark job:

  • By using the Spark Jobs API:

    curl -k -X DELETE <V4_JOBS_API_ENDPOINT>/<job-id> -H "Authorization: Bearer <ACCESS_TOKEN>"
    

    Returns 204 No Content if the job was successfully deleted.

    Note:

    If Running jobs are stopped, the Spark application will appear under the Incomplete Applications tab in Spark history server.

  • In jobs UI if the Spark advanced features are enabled. See Using advanced features:

    • From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
    • Click the open and close list of options icon on the right of the instance details page and select Deployment Space to open the deployment space on the Jobs tab where you can view the Spark jobs.
    • Find your job, click it to open the job's details page and cancel the job run.

Learn more

Parent topic: Getting started with Spark applications