watsonx.data Spark user journey

This topic guides you through the end-to-end workflow of working with the Spark engine. From the initial step of creating an application to its submission, monitoring its status, and ultimately stopping it.

Before your begin

Install IBM watsonx.data instance.
Provision a watsonx.data Spark engine inside the watsonx.data instance.

Required permissions: You must have the User role.

Creating storage

Create a storage volume to store the Spark application and related output.

Option1: Create a storage volume in IBM Software Hub. To create storage volume in IBM Software Hub, see Creating a storage volume.
Option2: Create Cloud Object Storage. To create Cloud Object Storage and a bucket, see Creating a storage bucket.
If you use Cloud Object Storage, register the Cloud Object Storage in watsonx.data, register Cloud Object Storage bucket. To register a Cloud Object Storage bucket, see Adding storage.

Uploading the Spark application to storage

Upload the Spark application to the storage volume.
- If you use IBM Software Hub storage volume, see Creating a storage volume.
- If you use Cloud Object Storage, see Uploading data.

Running Spark application

If your Spark application resides in IBM Software Hub storage volume, specify the parameter values and run the following CURL command to submit the application.

curl --request POST \
  --url https://<cpd_host_name>/lakehouse/api/<api_version>/spark_engines/<spark_engine_id>/applications \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'LhInstanceId: <instance_id>' \
  --data '{
  "application_details": {
    "application": "/myapp/<python file name>"
  },
  "volumes": [
    {
      "name": "cpd-instance::my-vol-1",
      "mount_path": "/myapp"
    }
  ]
}'

Parameter values:

<cpd_host_name>: The hostname of your IBM Software Hub.
<api_version>: When using the v2 API, set the <api_version> parameter to v2; for the v3 API, set it to v3.
<spark_engine_id> : The Engine ID of the native Spark engine.
<token> : The bearer token. For more information about generating the token, see Generating a bearer token.
<instance_id> : The instance ID from the watsonx.data cluster instance URL. For example, 1609968577169454.
<api_version>: When using the v2 API, set the <api_version> parameter to v2; for the v3 API, set it to v3.
<python file name> : The Spark application file name. It must be available in the storage volume.
<my-vol-1> : The display the name of the storage volume.

Checking the Spark application status

Log in to the watsonx.data cluster. Go to the Infrastructure manager page. To view details of the submitted Spark applications in watsonx.data, do the following:

In the Applications tab, you can view the list of all applications that are submitted to watsonx.data. The tab also displays the details such as the application status, Spark version, creation time, start time, and finish time.
Note: Application can have one of the following statuses.
- ACCEPTED: The application is waiting for the allocation of cluster resources.
- RUNNING: The application is executing its tasks on the cluster. The driver is managing the execution, and executors are processing data.
- WAITING: The application is waiting for cluster resources to be allocated mostly when there are not enough resources (memory or cores) in the cluster to run the application.
- FINISHED: The application has completed all its tasks and terminated successfully.
- FAILED: The application encountered an issue that caused it to stop running.
- ERROR : An application in the error state may have faced issues even before execution, such as misconfiguration, driver or executor errors, missing dependencies, or runtime issues preventing it from starting.
- SUBMITTED: The application is just submitted.
- STOPPED: The application was actively running but was stopped, either intentionally or due to external factors.
Click the arrow to the left of an application ID in the result list, to view more details like Spark application ID and Application name.
Note: You can also filter the applications based on status using the Filter icon.
To view the details of a particular Spark application by using API, provide the parameter details and run the following CURL command:
```
curl -X GET -H "content-type: application/json" -H "AuthInstanceId: {instance_id}" "https://{cpd-host}.cp.fyre.ibm.com/lakehouse/api/<api_version>/spark_engines/{engine_id}/applications/{application_id}"
```
Parameter values:
- {instance_id} : The instance ID from the watsonx.data cluster instance URL. For example, 1609968577169454.
- {cpd-host}: The hostname of your IBM Software Hub.
- <api_version>: When using the v2 API, set the <api_version> parameter to v2; for the v3 API, set it to v3.
- {engine_id} : The Engine ID of the native Spark engine.
- {application_id} : The application for which the details are viewed.

Stopping Spark application

To stop applications in watsonx.data, do the following:

Important: You can stop only the applications that are in RUNNING state.

In the Applications tab, select the application that you want to stop.
Click the overflow menu and select Stop. The application status changes to STOPPED.
To stop an application by using API, provide the parameter details and run the following CURL command:
```
curl -X DELETE -H "accept: */*" -H "AuthInstanceId: {instance_id}" "https://{cpd-host}.cp.fyre.ibm.com/lakehouse/api/<api_version>/spark_engines/{engine_id}/applications?application_id={application_id}"
```
Parameter values:
- {instance_id} : The instance ID from the watsonx.data cluster instance URL. For example, 1609968577169454.
- {cpd-host}: The hostname of your IBM Software Hub.
- <api_version>: When using the v2 API, set the <api_version> parameter to v2; for the v3 API, set it to v3.
- {engine_id} : The Engine ID of the native Spark engine.
- {application_id} : The application that needs to be stopped.