Known issues and limitations for watsonx.data integration
Known issues for Unstructured Data Integration
- In flows where a single document class is selected, classification, and extraction might not work
Applies to: 2.3.0 (IBM® Software Hub 5.3.0).
In a flow where only one document class is provided for processing, the documents might not be properly processed.In the Unstructured Data Integration flow, the Classification operator or Extract operator might fail to classify the documents or extract any entities respectively.
In unstructured data curation, the analysis flow might properly classify the documents. However, when you run the processing flow, the metrics might show that the documents were skipped for extraction or no entities were extracted.
Workaround: Manually update the generated flow:
- Replace the Classification operator with an Extract operator, and select all document classes.
- Remove any additional Extract operators that appear later in the flow.
- Incremental ingestion of unstructured data is not supported for Slack
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0).
Fixed in: 2.3.1
When Slack is used as a data source in an Unstructured Data Integration flow, incremental ingestion is not supported and documents are ingested when re-running the flow even if they were not modified. There is currently no workaround for this issue.
- Language annotator completes with warnings or errors when documents with unknown language are processed
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0).
Fixed in: 2.3.1
When processing documents with unknown language, the Language annotator node status might report
Completed with warningsorCompleted with errors, but the logs do not show the reason for such status.Tip: Workaround: Use the Filter if language cannot be detected toggle:- On: Documents are filtered out. Final status:
Completed With Errors. - Off (default): Documents processed with
lang_name="UNKNOWN"andlang_score=0. Final status:Completed With Warnings.
- On: Documents are filtered out. Final status:
- Milvus node fails with character length exception
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0).
Fixed in: 2.3.1
Milvus node fails with the following exception:MilvusException: (code=1100, message=length of varchar field text exceeds max lengthWorkaround: Use the chunking operator in the flow. Milvus supports fields up to 65,536 characters.
- Related assets not copied when copying flow from a catalog to a project
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
When you copy the Unstructured Data Integration flow from a catalog to a project using Add to project, related assets are not copied.
Workaround: Copy related assets manually and update the flow in the project. Alternatively, run:curl --insecure -X POST "/udp/v1/flows/{flow_id}/deepcopy" -H "Authorization: Bearer <bearer_token>" -H "Content-Type: application/json" -d '{"container_kind": "catalog", "container_id": "<catalog_id>", "target_container_kind": "project", "target_container_id": "<project_id>"}' - Flows using document libraries can't be promoted to space
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
A flow using document library does not work when it is promoted from a project to space, because document library will not be promoted along with the flow.
Workaround:- When designing the flow in a project, create a local parameter or a parameter set for the document library ID.
- Assign this parameter in the property panel of the document set operator, instead of directly entering the value of the document library ID.
- Promote the flow to a space when ready.
- Create the document library in space.
- When executing the flow in space, pass the document library as a parameter or a parameter set.
- Flows created by Unstructured Data Curation fail in deployment spaces at the Document Set operator
-
When you promote a flow created by Unstructured Data Curation to a deployment space, the flow might fail due to missing Presto connection configuration. The Document Set operator fails with an error
Missing or Invalid 'asset_id' idbecause the project settings (including Presto connection configuration) are not automatically promoted to spaces.Workaround: Before running an Unstructured Data Curation flow in a deployment space, you must manually configure a Presto connection:- Navigate to the deployment space Manage tab.
- Locate the Document set storage section.
- Add and configure a Presto connection.
- Save the settings.
- Run the promoted flow.
This configuration is required for the Document Set operator to access the necessary storage resources in the deployment space environment.
- Document Set and Entity Store operators using Python Orchestrator fail
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
The Document Set and Entity Store operators using Python Orchestrator might fail with the following error:Node Document set failed and caused aborting the branch execution: Please check if the MinIO bucket associated with the catalog, and the service route has been created or not.
Workaround: Ensure the following two prerequisites are met for these operators:- Access to the associated metadata store bucket
Each metadata store (for example, Hive or SQL) is associated with an S3 or Cloud Object Storage bucket. The user executing the operators must have access permissions to this underlying bucket. If the access is not already granted, you must add the required user or group to the bucket using the watsonx Infrastructure Manager Console. Without bucket access, the operators are not able to read or write data to the metadata store.
- Handling the default MinIO bucket (Non-production usage)For exploratory or non-production scenarios, watsonx.data includes a default MinIO bucket that is automatically associated with the metadata store. However, this default bucket uses an internal S3 endpoint that is not accessible from external systems such as Unstructured Data Integration. If you plan to use this default MinIO bucket, you must expose the endpoint externally so that it can be accessed by outside systems.Note: Creating the edge route exposes the MinIO console externally, allowing external clients to interact with it.Follow these steps to expose the MinIO bucket:
- Access the MinIO Console.
- Create an edge route to expose the MinIO
service:
oc create route edge ibm-lh-lakehouse-minio-console --service=ibm-lh-lakehouse-minio-svc --port=9000 - Retrieve the route host for the MinIO
service:
You will now see that the route is port forwarded and is accessible from external systems.oc get routes ibm-lh-lakehouse-minio-console - Extract the access and secret keys if
needed:
oc extract secret/ibm-lh-config-secret --to=- --keys=env.properties | grep -E "LH_S3_ACCESS_KEY|LH_S3_SECRET_KEY"
- Access to the associated metadata store bucket
- The flow node output preview table is not available when using Spark orchestrator
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
When you run a flow that uses Spark orchestrator, the preview table that shows all the node output is not available.
Workaround: There is currently no workaround for this issue.
- Iceberg metastore connection test is always successful
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
When you create a connection to Iceberg metastore and click Test connection, the test always passes. There is no validation for this test, so the result is unreliable.
Workaround: There is currently no workaround for this issue.
- Entity store operator fails if the target table has special characters depending on the source used
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
The Entity store operator will fail if the target table has special characters and Iceberg metastore is used.
Workaround: There is currently no workaround for this issue.
- Document set operator fails with Schema not found
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
Document set operator is failing when running it in Spark orchestrator.
Workaround: Document set operations are supported only for catalogs that are connected to the Spark engine within the watsonx.data Lakehouse. You can't use an external Presto connection to create document set or ingest data using Ingest document set. Ensure both the Spark engine and the catalog are present in the watsonx.data Lakehouse and connected.
- Entity store and document set operators fail when using Spark orchestrator and Presto hosted on Red Hat® OpenShift®
-
Applies to: 2.3.0 (IBM Software Hub 5.3.0) and later
Workaround: When running Entity store and Document set operators with Spark orchestrator, use watsonx.data Presto.
Known issues for Data Observability
- Data Observability entry missing in the navigation menu
-
Applies to: 5.3.0
Fixed in: 5.3.1
When Data Observability is installed as part of the watsonx.data integration installation bundle, the Data Observability entry is missing from the Navigation menu under Data.
Workaround: Append
/data-obsto the URL of the IBM Software Hub hostname:https://cpd-<projectname>.apps.<OCP-domain>/data-obsFor example:https://cpd-examplename.apps.exampledomain/data-obs - Databand operator subscription fails to update when applying patch 4 on IBM Software Hub 5.3.1
-
Applies to: 5.3.1 patch 4
When applying Patch 4 to IBM Software Hub base release, the
ibm-cpd-databand-operator-subscriptionfails to update correctly.Workaround: Manually delete the CSV before the upgrade to patch 4 by using the following command:
oc -n ${PROJECT_CPD_INST_OPERATORS} delete csv ibm-databand-operator.v1.0.0
Known issues for Data Replication
- Unable to restore Data Replication
custom request with incorrect
addonId -
Applies to: 5.3.0 and later
When Data Replication is installed as part the watsonx.data integration, the Data Replication custom resource does not restore properly with backup and restore. As a result, the Data Replication custom resource is recreated by the watsonx.data integration operator instead.
Workaround: Patch
replicationservice-crwith its correctaddonIdbefore the backup:oc patch replicationservice replicationservice-cr --type=merge -p '{"metadata":{"labels":{"icpdsupport/addOnId":"data-replication"}}}' - Data Replication service installation from private registry fails
-
Applies to: 5.3.1
Fixed in: 5.3.1 patch 2
The Data Replication service installation in air-gapped environments fails while pulling the container image. The image pull process references public registries instead of the private registry.
Workaround: The Data Replication service installation in air-gapped environments requires an image digest mirror set configuration to successfully pull containers from private registries. For details, see Configuring an image digest mirror set for IBM Software Hub software images.
Known issues for StreamSets
- Real-time streaming does not install common core services
-
Applies to: 5.3.1 and later
Fixed in: 5.3.1 patch 2
When you install the watsonx.data integration service and enable only the real-time streaming option, the service does not automatically install common core services.
Workaround: Enable at least one additional installation option along with real-time streaming to ensure that the common core services are installed automatically.
Known issues for watsonx.data integration
- No update of the custom resource versions of dependencies when upgrading from patch 1 to patch 2
-
Applies to: 5.3.1 patch 2
If you are already on patch 1 of watsonx.data integration, applying patch 2 does not update the custom resource versions of dependencies that were installed by watsonx.data integration and are included in patch 2. This issue does not affect you if you apply patch 2 directly from version 5.3.1.
Workaround: Currently, the watsonx.data integration operator runs its entire playbook only during reconciliation when there's an update to the custom resource or the operator image. To workaround this issue, make the following update to the custom resource:
oc -n ${PROJECT_CPD_INST_OPERANDS} patch watsonxdataintegration watsonxdataintegration-cr -p "{\"spec\":{\"update\":\"$(date)\"}}" --type=merge