Known issues and limitations for watsonx.data integration

The following issues and limitations apply to watsonx.data™ integration.

Known issues for Unstructured Data Integration

In flows where a single document class is selected, classification, and extraction might not work

Applies to: 5.4.0

In a flow where only one document class is provided for processing, the documents might not be properly processed.

In the Unstructured Data Integration flow, the Classification operator or Extract operator might fail to classify the documents or extract any entities respectively.

In unstructured data curation, the analysis flow might properly classify the documents. However, when you run the processing flow, the metrics might show that the documents were skipped for extraction or no entities were extracted.

Workaround: Manually update the generated flow:

  1. Replace the Classification operator with an Extract operator, and select all document classes.
  2. Remove any additional Extract operators that appear later in the flow.
Related assets not copied when copying flow from a catalog to a project

Applies to: 5.4.0

When you copy the Unstructured Data Integration flow from a catalog to a project using Add to project, related assets are not copied.

Workaround: Copy related assets manually and update the flow in the project. Alternatively, run:
curl --insecure -X POST "/udp/v1/flows/{flow_id}/deepcopy" -H "Authorization: Bearer <bearer_token>" -H "Content-Type: application/json" -d '{"container_kind": "catalog", "container_id": "<catalog_id>", "target_container_kind": "project", "target_container_id": "<project_id>"}'
Flows using document libraries can't be promoted to space

Applies to: 5.4.0

A flow using document library does not work when it is promoted from a project to space, because document library will not be promoted along with the flow.

Workaround:
  1. When designing the flow in a project, create a local parameter or a parameter set for the document library ID.
  2. Assign this parameter in the property panel of the document set operator, instead of directly entering the value of the document library ID.
  3. Promote the flow to a space when ready.
  4. Create the document library in space.
  5. When executing the flow in space, pass the document library as a parameter or a parameter set.
Intermittent flow failures

Applies to: 5.4.0

Some operators that are dependent on other IBM Services might fail with a number of issues such as:
  • Reading binary data failed from connection (Extract operator)
  • No results from runtime (Extract operator)

Workaround: Rerun the flow with incremental processing so that the failed documents can be processed again. If the same document fails multiple times, there might be another reason for the error.

Document Set and Entity Store operators using Python Orchestrator fail

Applies to: 5.4.0

The Document Set and Entity Store operators using Python Orchestrator might fail with the following error:
Node Document set failed and caused aborting the branch execution: Please check if the MinIO bucket associated with the catalog, and the service route has been created or not. 
Workaround: Ensure the following two prerequisites are met for these operators:
  • Access to the associated metadata store bucket

    Each metadata store (for example, Hive or SQL) is associated with an S3 or Cloud Object Storage bucket. The user executing the operators must have access permissions to this underlying bucket. If the access is not already granted, you must add the required user or group to the bucket using the watsonx Infrastructure Manager Console. Without bucket access, the operators are not able to read or write data to the metadata store.

  • Handling the default MinIO bucket (Non-production usage)
    For exploratory or non-production scenarios, watsonx.data includes a default MinIO bucket that is automatically associated with the metadata store. However, this default bucket uses an internal S3 endpoint that is not accessible from external systems such as Unstructured Data Integration. If you plan to use this default MinIO bucket, you must expose the endpoint externally so that it can be accessed by outside systems.
    Note: Creating the edge route exposes the MinIO console externally, allowing external clients to interact with it.
    Follow these steps to expose the MinIO bucket:
    1. Access the MinIO Console.
    2. Create an edge route to expose the MinIO service:
      oc create route edge ibm-lh-lakehouse-minio-console --service=ibm-lh-lakehouse-minio-svc --port=9000
    3. Retrieve the route host for the MinIO service:
      oc get routes ibm-lh-lakehouse-minio-console
      You will now see that the route is port forwarded and is accessible from external systems.
    4. Extract the access and secret keys if needed:
      oc extract secret/ibm-lh-config-secret --to=- --keys=env.properties | grep -E "LH_S3_ACCESS_KEY|LH_S3_SECRET_KEY"
    This step is only required when using the default internal MinIO bucket for testing or non-production purposes. Production-grade metadata stores already use S3 or COS buckets with external endpoints, and do not require port forwarding.
Iceberg metastore connection test is always successful

Applies to: 5.4.0

When you create a connection to Iceberg metastore and click Test connection, the test always passes. There is no validation for this test, so the result is unreliable.

Workaround: There is currently no workaround for this issue.

Entity store operator fails if the target table has special characters depending on the source used

Applies to: 5.4.0

The Entity store operator will fail if the target table has special characters and Iceberg metastore is used.

Workaround: There is currently no workaround for this issue.

Document set operator fails with Schema not found

Applies to: 5.4.0

Document set operator is failing when running it in Spark orchestrator.

Workaround: Document set operations are supported only for catalogs that are connected to the Spark engine within the Lakehouse. You can't use an external Presto connection to create document set or ingest data using ingest document set. Ensure both the Spark engine and the catalog are present in the Lakehouse and connected.

Known issues for Data Observability

Email alert recipients do not receive notifications about alerts

Applies to: 5.4.0

Email alert recipients do not receive notifications when alerts are triggered. This issue occurs because the system cannot retrieve the dynamic SMTP configuration from the IBM® Software Hub platform.

Workaround: Configure a different alert receiver type (such as Slack receiver, PagerDuty, or Microsoft Teams) to receive alert notifications until the email alert receiver issue is resolved.

Known issues for watsonx.data integration

Installation parameters incorrectly inherit DataStage Enterprise Plus configuration values

Applies to: 5.4.0

During an installation of the IBM Software Hub platform, the installation parameters for Data Observability (enableDataObservability) and StreamSets (enableRealtimeStreaming) incorrectly inherit their values from the DataStage Enterprise Plus batch/bulk ETL parameter (enableBatchBulkETL). As a result, Data Observability and StreamSets components are installed or configured based on the DataStage Enterprise Plus installation setting rather than their own explicit configuration values. This issue affects only a IBM Software Hub 5.4.0 installation – upgrades are not affected.

Workaround: To ensure correct component installation during a fresh installation of IBM Software Hub 5.4.0, explicitly set the enableDataObservability and enableRealtimeStreaming parameters in your installation configuration:

For Data Observability:
  1. If Data Observability is not required:
    oc -n ${PROJECT_CPD_INST_OPERANDS} patch watsonxdataintegration watsonxdataintegration-cr --patch '{"spec":{"enableDataObservability":false}}' --type=merge
  2. Delete the current custom resource, if you don’t want to install Data Observability:
    oc -n ${PROJECT_CPD_INST_OPERANDS} delete databandinstaller databand
For StreamSets:
  1. If StreamSets is not required:
    oc -n ${PROJECT_CPD_INST_OPERANDS} patch watsonxdataintegration watsonxdataintegration-cr --patch '{"spec":{" enableRealtimeStreaming ":false}}' --type=merge
  2. Delete the current custom resource, if you don’t want to install StreamSets:
    oc -n ${PROJECT_CPD_INST_OPERANDS} delete sdiinstaller streamsets-sdi