Running a flow with Spark
You can use the Spark orchestrator to run flows with large loads that consume more resources.
Prerequisites
- watsonx.ai
- watsonx.data Spark
- Datasift API Version 1.0.1092 or above
Configuring the Lakehouse and Spark instances
- In the IBM watsonx context, verify that Lakehouse is available: from the main menu, select Services > Service instances and search for IBM watsonx.data Lakehouse.
- Create a watsonx.data Spark instance inside Lakehouse:
- Open the Infrastructure Manager.
- Select Add Component.
- Select IBM Spark, click Next.
- In the Configurations tab, select Create a native Spark engine and specify version 3.5. Define the Engine home to store the logs. Either select from the existing volumes or create a new volume.
- Click Create.
Configuring Spark for the Unstructured Data Integration flow
- Open a project where you want to run the flow and go to the Manage tab.
- Select Enviroments and open the Templates tab.
- Click New template.
- Provide the following Spark environment details:
- Type: Spark
- Spark engine: Select the lakehouse instance
- Software version: Python 3.11 with watsonx.data Spark 3.5
- Click Create.
- Create a job with Spark to run the flow. In the Configuration step, provide the following information:
- Runtime:Spark
- Lakehouse instance: instance_name
- Service instances: engine_name, for example, Spark35
- Environments: environment_name
- Save the changes.
When you click Run Flow, the flow is executed on the Spark Cluster in a distributed fashion.
Accessing Spark logs
- Locate the volume name of the Spark instance. Open Infrastructure Manager, select your Spark instance, and check the associated volume name under Engine home volume.
- In the main menu, select Services > Instance. Select the instance from the previous step.
- Open the volume and find a folder named
spark. Within that folder, find the Spark Application ID and open. - Inside the application ID folder, find the Spark Pod ID. Open the
logsfolder. - Select the name ending with
stdout. It contains the Unstructured Data Integration logs.