What's new and changed in watsonx.data integration

watsonx.data integration updates can include new features and fixes. Releases are listed in reverse chronological order so that the latest release is at the beginning of the topic.

This release includes the following changes:

New features

This release of watsonx.data integration includes the following features:

Connect to AlloyDB for PostgreSQL databases: You can now use the AlloyDB for PostgreSQL connector in your DataStage flows to read and write data from AlloyDB for PostgreSQL databases.
Access data in AWS Databricks: You can now use the AWS Databricks connector in your DataStage flows to access and process data in Databricks workspaces.
Access files in Microsoft SharePoint: You can now use the Microsoft SharePoint Files on Canvas connector in your DataStage flows to read and write files stored in SharePoint document libraries.
Access data in Microsoft Dynamics 365: You can now use the Microsoft Dynamics 365 connector in your DataStage flows to read and write business data from Dynamics 365 applications.
Export and import compiled pipeline binaries: You can now export and import compiled Python binaries with optimized runner pipelines, which means that you can move pipelines together with their compiled assets. You control this behavior by using the include-python-binaries and include-common-binaries options in cpdctl.
Data encryption for Teradata connections: You can now enable full session data encryption for Teradata optimized flows by using the new Data Encryption option. This option uses either TDGSS or TLS/SSL to encrypt network traffic, SQL statements, data requests, and responses for the entire session.
Create parameter sets from connection properties: You can now create parameter sets directly from connection properties for supported connectors. Select one or more connection types and add their properties as parameters so that you can easily reuse and manage configuration values across pipelines.
Run remote engines on s390x remote engines: You can now run remote engines on s390x (IBM Z and LinuxONE) systems, deployed as Docker containers or in Kubernetes clusters. This allows you to submit jobs from x86_64 environments and execute them on s390x hardware. This capability enables workload distribution across heterogeneous architectures.

Receive alerts in Microsoft Teams or PagerDuty: You can now create alert receivers to connect Data Observability to your Microsoft Teams channels or PagerDuty services. When you create a PagerDuty alert receiver, you can track triggered alerts and manage events with your existing PagerDuty services. When you create a Microsoft Teams alert receiver, you can receive detailed notifications about triggered alerts in your Microsoft Teams channels.

Identify trends in your data by using metric charts: You can now add metric charts to your Data Observability dashboard. By adding metric charts, you can easily see how a metric has changed across jobs runs, which can help you identify trends in your data.
Reuse connection details in StreamSets flow: When you deploy a Data Collector engine version 7.4.0, you can include connections in StreamSets flows.

Easily manage and reuse StreamSets flows by using parameters: You can now use parameters in StreamSets flows to set values for stage properties at run time. You can change parameter values for each job run without editing the flow, making your flows easier to manage and reuse.

Choose how your browser connects to StreamSets engines: StreamSets engines can now use the tunneling communication method, giving you more flexibility in how your browser connects to the engine. With tunneling, the browser communicates with watsonx.data integration, which securely relays data to the engine through an encrypted tunnel. This method requires no additional setup and is enabled by default.

Run multiple engines for a StreamSets environment to support job failover: When you run multiple engines for a StreamSets environment, jobs can now fail over to another engine if the current engine becomes unavailable. The job restarts on an available engine and continues processing from where it stopped.

Track StreamSets job run history: You can now view a detailed history of a StreamSets job run to diagnose issues and understand the run state, including cases where a run remains in the Queued or Canceling status. The run history lists timestamped events that show status changes, retries, failovers, and other run activities.

Capture a snapshot of data as it moves through a StreamSets job run

You can now capture and view a snapshot to verify how a StreamSets job processes data. A snapshot is a set of data that is captured as it moves through a running job.

Similar to previewing a flow, you can view how snapshot data moves through a job stage by stage. You can drill down to review the values of each record to determine whether the stage transforms data as expected.

Process unstructured documents in multiple languages

You can now ingest and curate unstructured data documents in the following languages:

French
German
Italian
Japanese
Korean
Polish
Spanish

Use semantic chunking in Unstructured Data Integration

You can now select semantic chunking in the Chunking operator. This option produces chunks that follow natural topic and meaning boundaries rather than arbitrary size limits, resulting in more coherent context units, higher‑quality embeddings, more accurate retrieval, and reduced noise during downstream question‑answering.

For details, see Transform data nodes.

Summarize chunks with AI in Unstructured Data Integration

Generate AI-powered summaries for each document chunk to improve context understanding and retrieval accuracy.

For details, see Transform data nodes.

Ingest and store unstructured data by using more supported connectors

You can now ingest data from the following sources:

Confluence
Google Drive

You can also use the following target databases for vector store:

OpenSearch
DataStax Astra DB

You can use the following databases for storing document sets and for entity store:

Microsoft Azure Databricks
PostgreSQL
Db2
Oracle

Unstructured data curation supports a subset of these connectors. For details, see Supported connectors for unstructured data curation.

Work with more file types in Unstructured Data Integration

You can now process the following file types:

HTML
XLSX
BMP
GIF
JFIF
JPG
JPEG
PNG
TIFF
TIF

Unstructured data curation supports a subset of these file types. For details, see Supported connectors for unstructured data curation.

New features from 5.3.1 patches

This release of watsonx.data integration includes the following features that were introduced in IBM® Software Hub Version 5.3.1 patches:

Process Kafka 4.x data with StreamSets flows

When you deploy a StreamSets Data Collector engine version 7.2.0, you can use Kafka stages to process data in Kafka 4.x, in addition to Kafka 3.x.

For more information about Data Collector 7.2.0, see 7.2.x engine versions in the IBM watsonx.data intelligence and IBM watsonx.data integration.

Updates

The following updates were introduced in this release:

Before you install or upgrade watsonx.data integration, a cluster administrator must now create cluster-scoped resources, such as custom resource definitions, cluster roles, and cluster role bindings.
The StreamSets component in watsonx.data integration now automatically creates several defensive network policies.

Canvas updates

Data Preview in the DataStage Canvas in now disabled when Data Intelligence is provisioned and you do not have the Preview Data permission.
You can now enable or disable the flight service connector library, which is a feature in Beta, directly in ETL compile options.

PX engine and PX runtime updates

DataService jobs now properly update the status when running during pod restarts, preventing jobs from remaining stuck in "Running" or "Starting" states.
Environment variables with boolean values set to False are now properly retained instead of being dropped, ensuring correct behavior for volume connector configuration parameters.
Dynamic configuration file generation is now enabled when the APT_CONFIG_FILE environment variable is empty.
Data browse API now reads and writes time and timestamp data types with picosecond precision.
You can now run SAP RFC Server connections without thread termination, as errors no longer cause the RFC server threads to stop.

ds-runtime updates

You can now run DataStage workloads in multi‑tenant environments.
You can now use beta Flight support in the DataStage runtime.
You can now run nested loop jobs with improved caching behavior.

API updates

You can now use the SAP BW Load Pack in DataStage.
You can now update job details such as schedules and retention policies by setting the update_job_details flag to true.
To protect your work during backups, DataStage now prevents creating or updating flows while a project backup is in progress.
You can now force replacement of parameter sets, even when parameter types differ, by using replace_mode: force.
You can now use key pair authentication with ELT Pushdown.
You can now see advanced search results updated through batch indexing instead of real‑time ingestion, reducing delays when changes are reflected.

Connector updates

You can now configure PostgreSQL connections with alternative servers.
You can now enable SSL host validation for DB2 connections.
You can now configure the query_data_size connection property for DB2 connector.
You can now authenticate to Databricks by using Entra ID with client ID, client secret, and tenant ID.
You can now disable prepared statements for Presto connections.
You can now disable chunked encoding in WXD targets for FIPS environments by using the disable_chunked_encoding property.
You can now choose whether queries use column names or column labels when reading data from Generic JDBC sources. Use this option to control how data is retrieved, especially for sources such as SAS where column labels are defined in table metadata.

Customer-reported issues fixed in this release

For a list of customer-reported issues that were fixed in this release, see the Fix List for IBM Cloud Pak for Data on the IBM Support website.

Deprecated features

The following features were deprecated in this release:

StreamSets environments for Data Collector engine versions 6.4.x - 7.0.x

StreamSets Data Collector engine versions 6.4.x - 7.0.x are now deprecated. They will be removed from service in an upcoming release.

You can no longer configure new environments for the affected engine versions. However, you can continue using existing engines until they are removed.

For uninterrupted service and for the latest updates and features, upgrade to the latest available engine version.

For more information, see Upgrading Data Collector engines in the IBM watsonx.data intelligence and IBM watsonx.data integration documentation.