Submitting Spark application by using REST API

You can submit a Spark application by running a CURL command. Complete the following steps to submit a Python application.

watsonx.data on IBM Software Hub

Procedure

  1. Create a storage volume to store the Spark application and related output.
  2. If you use Cloud Object Storage, register the Cloud Object Storage in watsonx.data, register Cloud Object Storage bucket. To register a Cloud Object Storage bucket, see Adding storage.
  3. Upload the Spark application to the storage volume.
  4. If your Spark application resides in IBM Software Hub storage volume, specify the parameter values and run the following CURL command to submit the application.
    curl --request POST \
      --url https://<cpd_host_name>/lakehouse/api/v2/spark_engines/<spark_engine_id>/applications \
      --header 'Authorization: Bearer <token>' \
      --header 'Content-Type: application/json' \
      --header 'LhInstanceId: <instance_id>' \
      --data '{
      "application_details": {
        "application": "/myapp/<python file name>"
      },
      "volumes": [
        {
          "name": "cpd-instance::my-vol-1",
          "mount_path": "/myapp"
        }
      ]
    }'
    Parameter values:
    • <cpd_host_name>: The hostname of your IBM Software Hub.
    • <spark_engine_id> : The Engine ID of the native Spark engine.
    • <token> : The bearer token. For more information about generating the token, see Generating a bearer token.
    • <instance_id> : The instance ID from the watsonx.data cluster instance URL. For example, 1609968577169454.
    • <python file name> : The Spark application file name. It must be available in the storage volume.
    • <my-vol-1> : The display the name of the storage volume.
  5. Submitting an application by accessing the watsonx.data catalog.

    Use the following command to access data from a catalog that is associated with the Spark engine and to perform some basic operations on that catalog.

    Example:
    curl --request POST \
      --url https://<cpd_host_name>/lakehouse/api/v2/spark_engines/<spark_engine_id>/applications \
      --header 'Authorization: Bearer <token>' \
      --header 'Content-Type: application/json' \
      --header 'LhInstanceId: <instance_id>' \
      --data '{
      "application_details": {
        "application": "s3a://<application-bucket-name>/iceberg.py",
        "conf": {
            "spark.hadoop.wxd.apiKey":"ZenApiKey <encoded key>",
            "spark.app.name": "reader-app"     
        }
      }
    }'
    Parameter values:
    • <cpd_host_name>: The hostname of your IBM Software Hub.
    • <spark_engine_id> : The Engine ID of the native Spark engine.
    • <token> : The bearer token. For more information about generating the token, see Generating a bearer token.
    • <instance_id> : The instance ID from the watsonx.data cluster instance URL. For example, 1609968977179454.
    • <user-authentication-string> : The value must be in the format : echo -n "<username>:<your Zen API key>" | base64. The Zen API Key here is the API key of the user accessing the Object store bucket. To generate API key, log in into the watsonx.data console and navigate to Profile > Profile and Settings > API Keys and generate a new API key.
      Note: If you generate a new API key, your old API key becomes invalid.
    • <application-bucket-name> : The display the name of the storage volume.
  6. If your Spark application resides in ADLS(Gen1 or Gen2) and you want to submit the application by using DAS, specify the parameter values and run the following curl command. The following example shows the command to submit read.py application.
    Example 1:
    curl --request POST \
      --url https://<cpd_host_name>/lakehouse/api/v2/spark_engines/<spark_engine_id>/applications \
      --header 'Authorization: Bearer <token>' \
      --header 'Content-Type: application/json' \
      --header 'LhInstanceId: <instance_id>' \
      --data '{
      "application_details": {
        "application": "abfss://<storage_account>@<storage_container>.dfs.core.windows.net/adls-read.py",
        "conf": {
            "spark.hadoop.wxd.apikey":<token>,
            "spark.app.name": "reader-app"
        }
      }
    }'
    Parameter values:
    • <cpd_host_name>: The hostname of your IBM Software Hub.
    • <spark_engine_id> : The Engine ID of the native Spark engine.
    • <token> : The bearer token. For more information about generating the token, see Generating a bearer token.
    • <instance_id> : The instance ID from the watsonx.data cluster instance URL. For example, 1609968977179454.
    • <storage_account> : The name of the azure storage account.
    • <storage_container> : The name of the Azure storage container.
  7. If your Spark application resides in Google Cloud Storage and you want to submit the application by using DAS, specify the parameter values and run the following curl command. The following example shows the command to submit read.py application.
    Example 1 :
    curl --request POST \
      --url https://<cpd_host_name>/lakehouse/api/v2/spark_engines/<spark_engine_id>/applications \
      --header 'Authorization: Bearer <token>' \
      --header 'Content-Type: application/json' \
      --header 'LhInstanceId: <instance_id>' \
      --data '{
      "application_details": {
        "application": "gs://<application-bucket-name>//gcs-read.py",
        "conf": {
            "spark.hadoop.wxd.apikey":<token>,
            "spark.app.name": "reader-app"
        }
      }
    }
    
    Parameter values:
    • <cpd_host_name>: The hostname of your watsonx.data cluster.
    • <spark_engine_id> : The Engine ID of the native Spark engine.
    • <token> : The bearer token. For more information about generating the token, see Generating a bearer token.
    • <instance_id> : The instance ID from the watsonx.data cluster instance URL. For example, 1609968977179454.
    • <application-bucket-name> : The display the name of the storage volume.
  8. After you submit the Spark application, you receive a confirmation message with the application ID and Spark version. Save it for reference.
  9. Log in to the watsonx.data cluster, access the Engine details page. In the Applications tab, use the application ID to list the application and you can track the stages. For more information, see View and manage applications.
    Note: When you try to submit a Spark application by using API and if the JSON payload includes an error, the job fails without generating logs. . To troubleshoot, see Troubleshooting section.