Known issues and limitations for watsonx.data integration

The following issues and limitations apply to watsonx.data™ integration.

Known issues for Unstructured Data Integration

In flows where a single document class is selected, classification, and extraction might not work

Applies to: 2.3.0 (IBM® Software Hub 5.3.0).

In a flow where only one document class is provided for processing, the documents might not be properly processed.

In the Unstructured Data Integration flow, the Classification operator or Extract operator might fail to classify the documents or extract any entities respectively.

In unstructured data curation, the analysis flow might properly classify the documents. However, when you run the processing flow, the metrics might show that the documents were skipped for extraction or no entities were extracted.

Workaround: Manually update the generated flow:

  1. Replace the Classification operator with an Extract operator, and select all document classes.
  2. Remove any additional Extract operators that appear later in the flow.
Incremental ingestion of unstructured data is not supported for Slack

Applies to: 2.3.0 (IBM Software Hub 5.3.0).

Fixed in: 2.3.1

When Slack is used as a data source in an Unstructured Data Integration flow, incremental ingestion is not supported and documents are ingested when re-running the flow even if they were not modified. There is currently no workaround for this issue.

Language annotator completes with warnings or errors when documents with unknown language are processed

Applies to: 2.3.0 (IBM Software Hub 5.3.0).

Fixed in: 2.3.1

When processing documents with unknown language, the Language annotator node status might report Completed with warnings or Completed with errors, but the logs do not show the reason for such status.

Tip: Workaround: Use the Filter if language cannot be detected toggle:
  • On: Documents are filtered out. Final status: Completed With Errors.
  • Off (default): Documents processed with lang_name="UNKNOWN" and lang_score=0. Final status: Completed With Warnings.
Milvus node fails with character length exception

Applies to: 2.3.0 (IBM Software Hub 5.3.0).

Fixed in: 2.3.1

Milvus node fails with the following exception:
MilvusException: (code=1100, message=length of varchar field text exceeds max length

Workaround: Use the chunking operator in the flow. Milvus supports fields up to 65,536 characters.

Related assets not copied when copying flow from a catalog to a project

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

When you copy the Unstructured Data Integration flow from a catalog to a project using Add to project, related assets are not copied.

Workaround: Copy related assets manually and update the flow in the project. Alternatively, run:
curl --insecure -X POST "/udp/v1/flows/{flow_id}/deepcopy" -H "Authorization: Bearer <bearer_token>" -H "Content-Type: application/json" -d '{"container_kind": "catalog", "container_id": "<catalog_id>", "target_container_kind": "project", "target_container_id": "<project_id>"}'
Flows using document libraries can't be promoted to space

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

A flow using document library does not work when it is promoted from a project to space, because document library will not be promoted along with the flow.

Workaround:
  1. When designing the flow in a project, create a local parameter or a parameter set for the document library ID.
  2. Assign this parameter in the property panel of the document set operator, instead of directly entering the value of the document library ID.
  3. Promote the flow to a space when ready.
  4. Create the document library in space.
  5. When executing the flow in space, pass the document library as a parameter or a parameter set.
Flows created by Unstructured Data Curation fail in deployment spaces at the Document Set operator

When you promote a flow created by Unstructured Data Curation to a deployment space, the flow might fail due to missing Presto connection configuration. The Document Set operator fails with an error Missing or Invalid 'asset_id' id because the project settings (including Presto connection configuration) are not automatically promoted to spaces.

Workaround: Before running an Unstructured Data Curation flow in a deployment space, you must manually configure a Presto connection:
  1. Navigate to the deployment space Manage tab.
  2. Locate the Document set storage section.
  3. Add and configure a Presto connection.
  4. Save the settings.
  5. Run the promoted flow.

This configuration is required for the Document Set operator to access the necessary storage resources in the deployment space environment.

Document Set and Entity Store operators using Python Orchestrator fail

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

The Document Set and Entity Store operators using Python Orchestrator might fail with the following error:
Node Document set failed and caused aborting the branch execution: Please check if the MinIO bucket associated with the catalog, and the service route has been created or not. 
Workaround: Ensure the following two prerequisites are met for these operators:
  • Access to the associated metadata store bucket

    Each metadata store (for example, Hive or SQL) is associated with an S3 or Cloud Object Storage bucket. The user executing the operators must have access permissions to this underlying bucket. If the access is not already granted, you must add the required user or group to the bucket using the watsonx Infrastructure Manager Console. Without bucket access, the operators are not able to read or write data to the metadata store.

  • Handling the default MinIO bucket (Non-production usage)
    For exploratory or non-production scenarios, watsonx.data includes a default MinIO bucket that is automatically associated with the metadata store. However, this default bucket uses an internal S3 endpoint that is not accessible from external systems such as Unstructured Data Integration. If you plan to use this default MinIO bucket, you must expose the endpoint externally so that it can be accessed by outside systems.
    Note: Creating the edge route exposes the MinIO console externally, allowing external clients to interact with it.
    Follow these steps to expose the MinIO bucket:
    1. Access the MinIO Console.
    2. Create an edge route to expose the MinIO service:
      oc create route edge ibm-lh-lakehouse-minio-console --service=ibm-lh-lakehouse-minio-svc --port=9000
    3. Retrieve the route host for the MinIO service:
      oc get routes ibm-lh-lakehouse-minio-console
      You will now see that the route is port forwarded and is accessible from external systems.
    4. Extract the access and secret keys if needed:
      oc extract secret/ibm-lh-config-secret --to=- --keys=env.properties | grep -E "LH_S3_ACCESS_KEY|LH_S3_SECRET_KEY"
    This step is only required when using the default internal MinIO bucket for testing or non-production purposes. Production-grade metadata stores already use S3 or COS buckets with external endpoints, and do not require port forwarding.
The flow node output preview table is not available when using Spark orchestrator

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

When you run a flow that uses Spark orchestrator, the preview table that shows all the node output is not available.

Workaround: There is currently no workaround for this issue.

Iceberg metastore connection test is always successful

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

When you create a connection to Iceberg metastore and click Test connection, the test always passes. There is no validation for this test, so the result is unreliable.

Workaround: There is currently no workaround for this issue.

Entity store operator fails if the target table has special characters depending on the source used

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

The Entity store operator will fail if the target table has special characters and Iceberg metastore is used.

Workaround: There is currently no workaround for this issue.

Document set operator fails with Schema not found

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

Document set operator is failing when running it in Spark orchestrator.

Workaround: Document set operations are supported only for catalogs that are connected to the Spark engine within the watsonx.data Lakehouse. You can't use an external Presto connection to create document set or ingest data using Ingest document set. Ensure both the Spark engine and the catalog are present in the watsonx.data Lakehouse and connected.

Entity store and document set operators fail when using Spark orchestrator and Presto hosted on Red Hat® OpenShift®

Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later

Workaround: When running Entity store and Document set operators with Spark orchestrator, use watsonx.data Presto.

Error when trying to map features in Milvus node
When configuring the Generate output > Milvus node, the following error displays when trying to map features to columns:
Branched flow task execution failed for set_node_features in non operator execution flow with error:Failed connecting to Milvus DB.
even if the connection test passes successfully.

Applies to: 2.3.1 (IBM Software Hub 5.3.1) and later

Fixed in: 2.3.1 patch 5

Workaround: To solve the connection issue, apply the steps described in Unable to access Milvus from a non-VM macOS system.

Known issues for Data Observability

Data Observability entry missing in the navigation menu

Applies to: 5.3.0

Fixed in: 5.3.1

When Data Observability is installed as part of the watsonx.data integration installation bundle, the Data Observability entry is missing from the Navigation menu under Data.

Workaround: Append /data-obs to the URL of the IBM Software Hub hostname:

https://cpd-<projectname>.apps.<OCP-domain>/data-obs
For example:
https://cpd-examplename.apps.exampledomain/data-obs
Databand cannot upgrade with resource quota enabled

Applies to: 5.3.0 and later

The user cannot install or upgrade Databand from IBM Software Hub to 5.3.1 when a resource quota is configured.

Workaround:
  1. Verify that a resource quota is defined. If a resource quota is present, continue to step 2. If no quota is found, the upgrade constraint is attributable to a different underlying cause. Get the current requests and limits from the resource quota by running the following command:
    oc get resourcequota \
    --namespace=${PROJECT_CPD_INST_OPERANDS}

    The command returns output the following format:

    zen    cpd-quota    4s    requests.cpu: 76280m/200, requests.memory: 287349460172800m/1200Gi   limits.cpu: 207875m/800, limits.memory: 606134Mi/1800Gi
  2. Create a limit range for resources in the operands project.

    Use the following command as an example. Adjust the requests and limits based on the values set in the resource quota.

    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: LimitRange
    metadata:
      name: cpu-resource-limits
      namespace: ${PROJECT_CPD_INST_OPERANDS}
    spec:
      limits:
      - default:
          cpu: 300m
          memory: 200Mi
        defaultRequest:
          cpu: 200m
          memory: 200Mi
        type: Container
    EOF

    The values in the preceding example are based on the following values:

    Type Resource quota Limit range
    CPU request 76280m 200m
    CPU limit 207875m 300m
    Memory request 287349460172800m 200Mi
    Memory limit 606134Mi 200Mi

The installation or upgrade completes after you create the limit range.

Databand operator subscription fails to update when upgrading from IBM Software Hub 5.3.0 to 5.3.1 or upgrading IBM Software Hub 5.3.1 to patch 4 and later

Applies to: 5.3.1 patch 4 and later

When upgrading IBM Software Hub from 5.3.0 to 5.3.1 or upgrading IBM Software Hub 5.3.1 to patch 4 or later, the ibm-cpd-databand-operator-subscription fails to update correctly.

Workaround: Manually delete the CSV before the upgrade using the following command:

oc -n ${PROJECT_CPD_INST_OPERATORS} delete csv ibm-databand-operator.v1.0.0

Known issues for Data Replication

Unable to restore Data Replication custom request with incorrect addonId

Applies to: 5.3.0 and later

Fixed in: 5.3.1 patch 5

When Data Replication is installed as part the watsonx.data integration, the Data Replication custom resource does not restore properly with backup and restore. As a result, the Data Replication custom resource is recreated by the watsonx.data integration operator instead.

Workaround: Patch replicationservice-cr with its correct addonId before the backup:

oc patch replicationservice replicationservice-cr --type=merge -p '{"metadata":{"labels":{"icpdsupport/addOnId":"data-replication"}}}'
Data Replication service installation from private registry fails

Applies to: 5.3.1

Fixed in: 5.3.1 patch 2

The Data Replication service installation in air-gapped environments fails while pulling the container image. The image pull process references public registries instead of the private registry.

Workaround: The Data Replication service installation in air-gapped environments requires an image digest mirror set configuration to successfully pull containers from private registries. For details, see Configuring an image digest mirror set for IBM Software Hub software images.

Known issues for StreamSets

StreamSets installation fails when applying patch 5

Applies to: 5.3.1 patch 5

If you apply patch 5 of the watsonx.data integration service when the original installation was performed with an explicit image pull prefix of icr.io, then the StreamSets installation remains in the In Progress state indefinitely.

Workaround: Run the following command so that the patch correctly pulls images from cp.icr.io:

oc patch -n ${PROJECT_CPD_INST_OPERANDS} sdiinstaller.streamsets-sdi.cpd.ibm.com/streamsets-sdi -p $"spec:\n operandImagePullPrefix: "cp.icr.io"'
Real-time streaming does not install common core services

Applies to: 5.3.1 and later

Fixed in: 5.3.1 patch 2

When you install the watsonx.data integration service and enable only the real-time streaming option, the service does not automatically install common core services.

Workaround: Enable at least one additional installation option along with real-time streaming to ensure that the common core services are installed automatically.

Known issues for watsonx.data integration

No update of the custom resource versions of dependencies when upgrading from patch 1 to patch 2

Applies to: 5.3.1 patch 2

Fixed in: 5.3.1 patch 5

If you are already on patch 1 of watsonx.data integration, applying patch 2 does not update the custom resource versions of dependencies that were installed by watsonx.data integration and are included in patch 2. This issue does not affect you if you apply patch 2 directly from version 5.3.1.

Workaround: Currently, the watsonx.data integration operator runs its entire playbook only during reconciliation when there's an update to the custom resource or the operator image. To workaround this issue, make the following update to the custom resource:

oc -n ${PROJECT_CPD_INST_OPERANDS} patch watsonxdataintegration watsonxdataintegration-cr -p "{\"spec\":{\"update\":\"$(date)\"}}" --type=merge