Running a flow with Spark

You can use the Spark orchestrator to run flows with large loads that consume more resources.

Prerequisites

  • watsonx.ai
  • watsonx.data Spark
  • Datasift API Version 1.0.1092 or above

Configuring the Lakehouse and Spark instances

  1. In the IBM watsonx context, verify that Lakehouse is available: from the main menu, select Services > Service instances and search for IBM watsonx.data Lakehouse.
  2. Create a watsonx.data Spark instance inside Lakehouse:
    1. Open the Infrastructure Manager.
    2. Select Add Component.
    3. Select IBM Spark, click Next.
    4. In the Configurations tab, select Create a native Spark engine and specify version 3.5. Define the Engine home to store the logs. Either select from the existing volumes or create a new volume.
    5. Click Create.

Configuring Spark for the Unstructured Data Integration flow

  1. Open a project where you want to run the flow and go to the Manage tab.
  2. Select Enviroments and open the Templates tab.
  3. Click New template.
  4. Provide the following Spark environment details:
    • Type: Spark
    • Spark engine: Select the lakehouse instance
    • Software version: Python 3.11 with watsonx.data Spark 3.5
  5. Click Create.
  6. Create a job with Spark to run the flow. In the Configuration step, provide the following information:
    • Runtime:Spark
    • Lakehouse instance: instance_name
    • Service instances: engine_name, for example, Spark35
    • Environments: environment_name
  7. Save the changes.

When you click Run Flow, the flow is executed on the Spark Cluster in a distributed fashion.

Accessing Spark logs

  1. Locate the volume name of the Spark instance. Open Infrastructure Manager, select your Spark instance, and check the associated volume name under Engine home volume.
  2. In the main menu, select Services > Instance. Select the instance from the previous step.
  3. Open the volume and find a folder named spark. Within that folder, find the Spark Application ID and open.
  4. Inside the application ID folder, find the Spark Pod ID. Open the logs folder.
  5. Select the name ending with stdout. It contains the Unstructured Data Integration logs.