Submitting Spark jobs via API

In IBM Cloud Pak for Data, you can run Spark jobs or applications on your IBM Cloud Pak for Data cluster without installing Watson Studio by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark.

Service This service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Submitting Spark jobs

You can submit any Spark application that runs Spark SQL or data transformation, data science and machine learning jobs by using the Spark jobs REST API. Each submitted job runs in a dedicated cluster. Any configuration settings that you pass through the jobs API, will override the default configurations.

To submit a Spark job:

  1. Get the service endpoint to the Spark job API:

    1. From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the instance and click it to view the instance details.
    2. Under Access information, copy and save the Spark jobs endpoint.
  2. Generate a token if you haven’t already done so. See Generate an access token.
  3. Submit the job using the endpoint and access token that you generated. The job that you want to submit can be located in the file storage system on the IBM Cloud Pak for Data cluster or in IBM Cloud Object Storage. See Storage considerations.

    The following examples show how to submit a word count application to the Spark instance. See Spark jobs API syntax for a description of the submit Spark jobs syntax, the parameters you can use and the returned error codes.

    The first example includes the minimal mandatory parameters that are required:

     curl -k -X POST <JOB_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{"engine":{"type":"spark"},"application_arguments":["/opt/ibm/spark/examples/src/main/resources/people.txt"],"application": "/opt/ibm/spark/examples/src/main/python/wordcount.py"}' 
    

    The second example builds on the first example and includes how to customize the cluster hardware sizes:

     curl -k -X POST <JOB_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{"engine":{"type":"spark","template_id":"spark-2.4.0-jaas-v2-cp4d-template","conf":{"spark.app.name":"myJob"},"size":{"num_workers":"1","worker_size":{"cpu":1,"memory":"1g"},"driver_size":{"cpu":1,"memory":"1g"}}},"application_arguments":["1"],"application_jar":"/opt/ibm/spark/examples/jars/spark-examples*.jar","main_class":"org.apache.spark.examples.SparkPi"}' 
    

Viewing Spark job status

After you have submitted your Spark job, you can view the job details, including the run status. You can do this:

Deleting Spark jobs

You can delete a Spark job:

Learn more