watsonx.data Spark user journey
This topic guides you through the end-to-end workflow of working with the Spark engine. From the initial step of creating an application to its submission, monitoring its status, and ultimately stopping it.
Before your begin
- Install IBM watsonx.data instance.
- Provision a watsonx.data Spark engine inside the watsonx.data instance.
- Required permissions
- You must have the User role.
Creating storage
Create a storage volume to store the Spark application and related output.
- Option1: Create a storage volume in IBM Software Hub. To create storage volume in IBM Software Hub, see Creating a storage volume.
- Option2: Create Cloud Object Storage. To create Cloud Object Storage and a bucket, see Creating a storage bucket.
-
If you use Cloud Object Storage, register the Cloud Object Storage in watsonx.data, register Cloud Object Storage bucket. To register a Cloud Object Storage bucket, see Adding storage.
Uploading the Spark application to storage
- Upload the Spark application to the storage volume.
- If you use IBM Software Hub storage volume, see Creating a storage volume.
- If you use Cloud Object Storage, see Uploading data.
Running Spark application
If your Spark application resides in IBM Software Hub storage volume, specify the parameter
values and run the following CURL command to submit the application.
curl --request POST \
--url https://<cpd_host_name>/lakehouse/api/<api_version>/spark_engines/<spark_engine_id>/applications \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--header 'LhInstanceId: <instance_id>' \
--data '{
"application_details": {
"application": "/myapp/<python file name>"
},
"volumes": [
{
"name": "cpd-instance::my-vol-1",
"mount_path": "/myapp"
}
]
}' Parameter values:- <cpd_host_name>: The hostname of your IBM Software Hub.
- <api_version>: When using the v2 API, set the
<api_version>parameter tov2; for thev3API, set it tov3. - <spark_engine_id> : The Engine ID of the native Spark engine.
- <token> : The bearer token. For more information about generating the token, see Generating a bearer token.
- <instance_id> : The instance ID from the watsonx.data cluster instance
URL. For example,
1609968577169454. - <api_version>: When using the v2 API, set the
<api_version>parameter tov2; for thev3API, set it tov3. - <python file name> : The Spark application file name. It must be available in the storage volume.
- <my-vol-1> : The display the name of the storage volume.
Checking the Spark application status
Log in to the watsonx.data cluster. Go to the Infrastructure manager page. To view details of the submitted Spark applications in watsonx.data, do the following:
- In the Applications tab, you can view the list of all applications that
are submitted to watsonx.data. The tab also displays the details such as the application status, Spark version,
creation time, start time, and finish time.Note: Application can have one of the following statuses.
- ACCEPTED: The application is waiting for the allocation of cluster resources.
- RUNNING: The application is executing its tasks on the cluster. The driver is managing the execution, and executors are processing data.
- WAITING: The application is waiting for cluster resources to be allocated mostly when there are not enough resources (memory or cores) in the cluster to run the application.
- FINISHED: The application has completed all its tasks and terminated successfully.
- FAILED: The application encountered an issue that caused it to stop running.
- ERROR : An application in the error state may have faced issues even before execution, such as misconfiguration, driver or executor errors, missing dependencies, or runtime issues preventing it from starting.
- SUBMITTED: The application is just submitted.
- STOPPED: The application was actively running but was stopped, either intentionally or due to external factors.
- Click the arrow to the left of an application ID in the result list, to view more details like
Spark application ID and Application name. Note: You can also filter the applications based on status using the Filter icon.
- To view the details of a particular Spark application by using API, provide the parameter
details and run the following CURL
command:
curl -X GET -H "content-type: application/json" -H "AuthInstanceId: {instance_id}" "https://{cpd-host}.cp.fyre.ibm.com/lakehouse/api/<api_version>/spark_engines/{engine_id}/applications/{application_id}"Parameter values:- {instance_id} : The instance ID from the watsonx.data cluster instance
URL. For example,
1609968577169454. - {cpd-host}: The hostname of your IBM Software Hub.
- <api_version>: When using the v2 API, set the
<api_version>parameter tov2; for thev3API, set it tov3. - {engine_id} : The Engine ID of the native Spark engine.
- {application_id} : The application for which the details are viewed.
- {instance_id} : The instance ID from the watsonx.data cluster instance
URL. For example,
Stopping Spark application
To stop applications in watsonx.data, do the following:
Important: You can stop only the applications that are in
RUNNING
state.- In the Applications tab, select the application that you want to stop.
- Click the overflow menu and select Stop. The application status changes
to
STOPPED. - To stop an application by using API, provide the parameter details and run the following CURL
command:
curl -X DELETE -H "accept: */*" -H "AuthInstanceId: {instance_id}" "https://{cpd-host}.cp.fyre.ibm.com/lakehouse/api/<api_version>/spark_engines/{engine_id}/applications?application_id={application_id}"Parameter values:- {instance_id} : The instance ID from the watsonx.data cluster instance
URL. For example,
1609968577169454. - {cpd-host}: The hostname of your IBM Software Hub.
- <api_version>: When using the v2 API, set the
<api_version>parameter tov2; for thev3API, set it tov3. - {engine_id} : The Engine ID of the native Spark engine.
- {application_id} : The application that needs to be stopped.
- {instance_id} : The instance ID from the watsonx.data cluster instance
URL. For example,