Replicating IBM watsonx.data data

You can replicate data from other databases to IBM watsonx.data with the Data Replication service.

IBM watsonx.data is an open, hybrid, and governed fit-for-purpose data store optimized to scale all data, analytics, and AI workloads to get greater value from your analytics ecosystem.

Supported versions

IBM watsonx.data 2.1.0 on a Red Hat OpenShift cluster where Data Replication is installed.

Data Replication with watsonx.data only supports the following:

  • IBM watsonx.data connections that use a username and API key
    Important: You cannot use an IBM watsonx.data connection that does not use a username and API key when you set up a replication asset.
  • IBM watsonx.data Iceberg catalogs
  • Amazon S3-compatible storage

Restriction

IBM watsonx.data can only be used as a target data store for Data Replication.

Before you begin

Before replicating data to IBM watsonx.data, configure an Iceberg catalog and S3-compatible storage that the watsonx.data instance can use. For details about configuring watsonx.data, see IBM watsonx.data documentation.

Connecting to IBM watsonx.data in a project

To connect to IBM watsonx.data in a project, see IBM watsonx.data Presto connection.

Creating a replication asset with IBM watsonx.data

To create a Data Replication asset:

  1. Click the Assets tab in the project.

  2. Click New asset > Replicate data.

  3. Enter a name.

  4. Click Connections.

  5. On the Source options page, select a source connection from the list of connections or click Add connection to create a new connection.

  6. Click Select data, select a schema, and optionally a table from the schema.

  7. On the Target options page, select watsonx.data from the list, or click Add Connection to create a new connection.

    watsonx.data connections require additional parameters as follows:

    1. Select the Iceberg catalog within the watsonx.data target for the replication job to use.

    2. Set additional parameters such as the replication data file prefix and various thresholds that determine when and how Data Replication commits data.

    3. You can set the aggregation buffer size for a specific table to control how much memory, in megabytes, is used by the replication process to combine changes to the source data before saving the data into an Apache Parquet file format.

      Use the aggregation buffer to manage changes to large files in the source datastore. You can reliably replicate delete operations that collide with insert operations when you configure the aggregation buffer. Insert operations are retained in the buffer before commiting the changes to the S3 file system. Any delete operations that collide with the buffered insert operations are removed from the buffer.

      Set the buffer size to 0 to disable the aggregation effect.

  8. On the Review page, review the summary, then click Create.

By default, Data Replication creates a namespace with the same name as the source schema. If you specify a value for target schema, Data Replication uses the provided schema name for the target namespace.

Next step

Running replication jobs