Structure of the data and metadata objects used to store a table in Apache Iceberg open table format

This outlines the structure of the data and metadata objects Data Gate for watsonx uses to store a table in Apache Iceberg open table format.

Structure of the data and metadata objects

  • The root folder for Db2® for z/OS® sources is:
    bucket/<Db2z-location-name>/<DG-pairing-name>
    where <Db2z-location-name> is the name of the connected Db2 subsystem or data sharing group in the SYSIBM.LOCATIONS table, and <DG-pairing-name> is the name of the pairing.
  • The root folder for VSAM and IMS sources is:
    bucket/<dg-instance-id>
    where <dg-instance-id> is a Data Gate for watsonx instance ID.

The structure of objects in the Apache Iceberg target table is the same for all data sources:

/<schema-1>.db
    /<table-1>
        /metadata
            metadata files
        /data
            data files
    /<table-2>
        /metadata
            metadata files
        /data
            data files
/<schema-2>.db
    /<table-1>
        /metadata
            metadata files
        /data
            data files

Data Gate implements version 2 of the Iceberg specification, which supports row-level updates and deletes for analytic tables with immutable files. For more details about the Iceberg format, see the official Iceberg documentation here.

Identifier field IDs

An Apache Iceberg table schema includes information about the primary keys in the original Db2 for z/OS source tables. This information is stored in a identifier-field-ids metadata object. The additional metadata information can be used by computational engines that support the Apache Iceberg data format. The data-types that can be used as primary keys in Apache Iceberg tables fed by Data Gate for watsonx are listed in the Iceberg specification. See Identifier Field IDs.

Note:

If you previously loaded Db2 for z/OS tables using a Data Gate for watsonx version without identifier-field-id support, this additional metadata field is missing in the target table Iceberg metadata. In order to add it via Data Gate, drop the affected tables and add these to the Data Gate instance again.

Apache Iceberg table format versions

Starting in Data Gate version 5.4.0, Data Gate for watsonx supports both version 1 and version 2 of the Apache Iceberg table format. DGx instances can be deployed with either version in the DGx provisioning UI. The table format version is set at provisioning time and should not be changed after the instance is created. If a different table format version is required, deploy a new DGx instance with the desired version. Changing the Apache Iceberg table format version on an existing instance can leave it in a bad state.

The two versions use different write strategies for handling table updates and deletes:

  • Version 1 uses the copy-on-write (CoW) write strategy. When rows in an already-loaded table are updated or deleted, the parquet files containing those rows are entirely rewritten and included in a new snapshot. Each load or reload of a table results in the creation of a new snapshot, with all affected parquet files fully rewritten. Version 1 is often faster than version 2 for read operations, particularly when the table contains a large number of updates and deletes.
  • Version 2 uses the merge-on-read (MoR) write strategy, which introduces delete files (positional and equality delete files) to handle updates and deletes. Upon a table load or reload, delete files are added to mark the affected rows in the new snapshot. Affected parquet files are not rewritten, which makes version 2 significantly faster for write operations.
Note:

Replication is supported only for Apache Iceberg table format version 2. For DGx instances using version 1, replication is disabled and all replication-related features in the Datagate UI are unavailable.

For more details about Apache Iceberg table format versioning, see the Apache Iceberg specification.