Creating unstructured data flows

Use the intuitive drag-and-drop interface to build reusable end-to-end pipelines that process and transform unstructured data.

Prerequisites

Unstructured Data Integration is part of the watsonx.data and Data Fabric experience experience. You need to have a project set up in watsonx.data to start working with Unstructured Data Integration assets.

Review the following prerequisites for working with unstructured data.

The following nodes in the flow require watsonx.data intelligence:

  • Ingest > From Document Sets
  • Quality > Terms and Classifications assignment
  • Quality > Data class assignment
  • Generate Output > Document set

The following nodes in the flow require watsonx.ai:

  • Quality > PII and HAP annotator

The following nodes in the flow require watsonx.ai Runtime with Watson Document Understanding models deployed:

  • Transform data > Embeddings
  • Extract data

If these components are not available, the nodes might fail at design time or at run time.

Working with the canvas

You use the graphical canvas to build your flow. A flow is a pipeline of steps for processing your data. Each step in the flow consists of an operator node. Operators operate on documents, and each operator has a different task, for example: data ingestion, data annotation, generating vector embeddings. They consume the representation of a document and produce additional data about this document. On the left side of the canvas, you use the nodes palette to drag and drop the operators on the canvas and organize them into a sequence. Some nodes are mandatory, some must be used in a specified order. See Data preparation nodes for more information.

For each node you can view details and set some parameters in the right side panel that opens when you double-click the node.

The properties panel for each node shows input features which are the accumulation of all the features added by the previous operators sorted in descending order, and an editable list of output features where the user can decide which features to pass to the following node.

In addition, selected properties can be parameterized, as described in Working with parameters for node properties.

Adding a flow asset to your project

Follow these steps to add a flow to your project:

  1. Open a project.
  2. Click New asset.
  3. Click Create an unstructured data flow.
  4. Enter a name and an optional description.
  5. Click Create to open the canvas.

Building a flow

Follow these high-level steps to build a flow.

  1. Double-click or drag any node objects onto the canvas. For example, drag a Data assets node onto the canvas.
  2. Configure a node as required. Double-click a node on the canvas to open the properties configuration panel on the right.
  3. To connect and order the flow, drag the chevron icon from one node to another.
  4. Save the flow.

Running a flow

Running a flow requires some configuration, which is called a job. A job can be created automatically with default settings, or you can create it yourself before running the flow.

When you create a job, you define the properties for the job, such as the name, definition, environment runtime, schedule, notifications. You can run a job immediately or wait for the job to run at the next scheduled interval.

When you click Run flow:

  • If no job was created for this flow, a job with default settings (default name, no schedule, no notifications) is created and executed.
  • If a job was previously created by using Run flow or Create job, then the job is re-executed.

Each run of the flow is called a job run.

After you create a flow, you might want to first execute it with small test loads using Python to validate that the flow is working as expected. Later you can use Spark with large production load.

With a flow open, click Run flow on the toolbar to run it immediately with default settings. By default, a flow runs with the Python runtime that is using the container file system as storage. When you click Edit job, you can select the environment, a service instance, and storage volumes.

If you want to configure the automated job for running the flow, click Create job:

  1. Enter a name for the job and, optionally, a description.
  2. Select a runtime.
  3. Configure a schedule: Select when you want to run the job. You can also select to run the job only once, after it is created.
  4. Configure notifications for the job to get alerted about job success, warnings, and failures. By default, the notifications are off.
  5. Click Create or run.

Each time a job is started, a job run is created, which you can monitor and use to compare with the job run history of previous runs. Open the Jobs tab in your project to view detailed information about each job run, job state changes, and job failures in the job run log. To review logs from each execution, click the Start time entry for the execution.

Debugging a flow

When running a flow, you can monitor the progress for each node by monitoring the status icons, or view log details and node run summary in a dedicated panel for each node for debugging the problems easily:

  • Use the Log details tab to view and download the log.
  • Use the Node summary tab to review the status of the run a node, how many dosuments were processed amd other details. You can also click View table to see the features generated by the flow up to the selected node, presented in a table.

The table preview feature might slow down performance. You can choose to switch the feature off:

  1. In the flow canvas click the Flow properties icon on the toolbar.
  2. Deselect Enable node output preview for the flow.
  3. Click Save.

Promoting a flow to a deployment space

When your flow is ready for testing or production, you can deploy it to a space. You can also propagate changes from project to project, but this means that users will be able to modify assets in each environment. If you want to limit permissions more, you can propagate from a project to a space, where users can only update connections, update parameter sets, and run jobs. You can use deployment spaces for testing or production to maintain a strict separation from the development environment.

Promoting a version of an asset to a space creates a new asset in the space, with a new asset ID. Promoting an asset promotes dependent assets as well, so associated connections, data assets and parameter sets will be available together with the flow. Users can run the flow, change parameters, but they can't modify the flow itself.

To promote a flow to a space:

  1. In your project, go to the Assets tab.
  2. Select the flow you want to promote and click Promote to space.
  3. Select the space or create a new one.
  4. Fill in other required details and click Promote. Do not close the window until promotion of all assets is completed.
Note: Document libraries can't be promoted from a project to a space.

To work around this issue:

  1. When designing the flow in a project, create a local parameter or a parameter set for the document library ID.
  2. Assign this parameter in the property panel of the document set operator, instead of directly entering the value of the document library ID.
  3. Promote the flow to a space when ready.
  4. Create the document library in space.
  5. When executing the flow in space, pass the document library as a parameter or a parameter set.