Structure of the data and metadata objects used to store a table in Apache Iceberg open table format
This outlines the structure of the data and metadata objects Data Gate for watsonx uses to store a table in Apache Iceberg open table format.
Structure of the data and metadata objects
- The root folder for Db2® for z/OS® sources
is:
wherebucket/<Db2z-location-name>/<DG-pairing-name><Db2z-location-name>is the name of the connected Db2 subsystem or data sharing group in the SYSIBM.LOCATIONS table, and<DG-pairing-name>is the name of the pairing. - The root folder for VSAM and IMS sources
is:
wherebucket/<dg-instance-id><dg-instance-id>is a Data Gate for watsonx instance ID.
The structure of objects in the Apache Iceberg target table is the same for all data sources:
/<schema-1>.db
/<table-1>
/metadata
metadata files
/data
data files
/<table-2>
/metadata
metadata files
/data
data files
/<schema-2>.db
/<table-1>
/metadata
metadata files
/data
data files
Data Gate implements version 2 of the Iceberg specification, which supports row-level updates and deletes for analytic tables with immutable files. For more details about the Iceberg format, see the official Iceberg documentation here.
Identifier field IDs
An Apache Iceberg table schema includes information about the primary keys in the original Db2 for z/OS source tables. This information is stored in a identifier-field-ids metadata object. The additional metadata information can be used by computational engines that support the Apache Iceberg data format. The data-types that can be used as primary keys in Apache Iceberg tables fed by Data Gate for watsonx are listed in the Iceberg specification. See Identifier Field IDs.
If you previously loaded Db2 for z/OS tables
using a Data Gate for watsonx version without
identifier-field-id support, this additional metadata field is missing in the
target table Iceberg metadata. In order to
add it via Data Gate, drop the affected tables
and add these to the Data Gate instance
again.
Apache Iceberg table format versions
Starting in Data Gate version 5.4.0, Data Gate for watsonx supports both version 1 and version 2 of the Apache Iceberg table format. DGx instances can be deployed with either version in the DGx provisioning UI. The table format version is set at provisioning time and should not be changed after the instance is created. If a different table format version is required, deploy a new DGx instance with the desired version. Changing the Apache Iceberg table format version on an existing instance can leave it in a bad state.
The two versions use different write strategies for handling table updates and deletes:
- Version 1 uses the copy-on-write (CoW) write strategy. When rows in an already-loaded table are updated or deleted, the parquet files containing those rows are entirely rewritten and included in a new snapshot. Each load or reload of a table results in the creation of a new snapshot, with all affected parquet files fully rewritten. Version 1 is often faster than version 2 for read operations, particularly when the table contains a large number of updates and deletes.
- Version 2 uses the merge-on-read (MoR) write strategy, which introduces delete files (positional and equality delete files) to handle updates and deletes. Upon a table load or reload, delete files are added to mark the affected rows in the new snapshot. Affected parquet files are not rewritten, which makes version 2 significantly faster for write operations.
Replication is supported only for Apache Iceberg table format version 2. For DGx instances using version 1, replication is disabled and all replication-related features in the Datagate UI are unavailable.
For more details about Apache Iceberg table format versioning, see the Apache Iceberg specification.