Creating unstructured data flows

Use the intuitive drag-and-drop interface to build reusable end-to-end pipelines that process and transform unstructured data.

Prerequisites

Unstructured Data Integration is part of the watsonx.data and Data Fabric experience experience. You need to have a project set up in watsonx.data to start working with Unstructured Data Integration assets.

Review the following prerequisites for working with unstructured data.

The following nodes in the flow require watsonx.data intelligence:

  • Ingest > From Document Sets
  • Quality > Terms and Classifications assignment
  • Quality > Data class assignment
  • Generate Output > Document set

The following nodes in the flow require Machine Learning with Watson Document Understanding models deployed:

  • Transform data > Embeddings
  • Extract data

If these components are not available, the nodes might fail at design time or at run time.

Working with the canvas

You use the graphical canvas to build your flow. A flow is a pipeline of steps for processing your data. Each step in the flow consists of an operator node. Operators operate on documents, and each operator has a different task, for example: data ingestion, data annotation, generating vector embeddings. They consume the representation of a document and produce additional data about this document. On the left side of the canvas, you use the nodes palette to drag and drop the operators on the canvas and organize them into a sequence. Some nodes are mandatory, some must be used in a specified order. See Data preparation nodes for more information.

For each node you can view details and set some parameters in the right side panel that opens when you double-click the node.

Adding a flow asset to your project

Follow these steps to add a flow to your project:

  1. Open a project.
  2. Click New asset.
  3. Click Create an unstructured data flow.
  4. Enter a name and an optional description.
  5. Click Create to open the canvas.

Building a flow

Follow these high-level steps to build a flow.

  1. Double-click or drag any node objects onto the canvas. For example, drag a Data assets node onto the canvas.
  2. Configure a node as required. Double-click a node on the canvas to open the properties configuration panel on the right.
  3. To connect and order the flow, drag the chevron icon from one node to another.
  4. Save the flow.

Running a flow

Running a flow requires some configuration, which is called a job. A job can be created automatically with default settings, or you can create it yourself before running the flow.

When you create a job, you define the properties for the job, such as the name, definition, environment runtime, schedule, notifications. You can run a job immediately or wait for the job to run at the next scheduled interval.

When you click Run flow:

  • If no job was created for this flow, a job with default settings (default name, no schedule, no notifications) is created and executed.
  • If a job was previously created by using Run flow or Create job, then the job is re-executed.

Each run of the flow is called a job run.

A runtime is used to run a flow - you can choose the execution engine for each run depending on your needs:

  • Select Python for flows with small loads consuming less resources.
  • Select Spark for flows with large loads that consume more resources.

After you create a flow, you might want to first execute it with small test loads using Python to validate that the flow is working as expected. Later you can use Spark with large production load.

With a flow open, click Run flow on the toolbar to run it immediately with default settings. By default, a flow runs with the Python runtime that is using the container file system as storage. When you click Edit job, you can change the runtime to Spark, and select the environment, a service instance, and storage volumes.

If you want to configure the automated job for running the flow, click Create job:

  1. Enter a name for the job and, optionally, a description.
  2. Select a runtime.
  3. Configure a schedule: Select when you want to run the job. You can also select to run the job only once, after it is created.
  4. Configure notifications for the job to get alerted about job success, warnings, and failures. By default, the notifications are off.
  5. Click Create or run.

Each time a job is started, a job run is created, which you can monitor and use to compare with the job run history of previous runs. Open the Jobs tab in your project to view detailed information about each job run, job state changes, and job failures in the job run log. To review logs from each execution, click the Start time entry for the execution.