Differences between Iceberg and Hive Datalake tables

There are several differences between Iceberg and Hive Datalake tables. For example, unlike Hive Datalake tables, Iceberg tables fully manage their data, even when it’s stored on external storage.

The following list are some of the key differences:

  • Unlike Iceberg tables, Hive Datalake tables are not designed to handle concurrent operations. This means that issues can occur when doing concurrent DDL, INSERTS and SELECT operations.
  • It is not possible to add more data to an Iceberg table by adding a file in its data directory. The list of files making up the content of the table are tracked through separate metadata and will be ignored. Likewise, it’s not possible to delete data by manually removing files from the table data directory. Data can be added to the data directory by Db2 as well as other Iceberg data engines like Spark by using Iceberg APIs. Unlike Hive Datalake tables, this data is visible immediately by queries against that data.
  • It is not possible to catalog an existing Iceberg table into Db2 by issuing a CREATE DATALAKE TABLE statement and indicating the Iceberg table location with the table LOCATION clause. Use the EXT_METASTORE_SYNC procedure to import the table into Db2.
  • The directory structure for Iceberg tables differs from Hive Datalake tables. Iceberg tables include sub-paths for the metadata and data files that belong to the table. If, for example, you create a table called T1, the sub-paths for that table are T1/metadata and T1/data. T1/metadata is used to store the metadata, manifest, and files. T1/data is used to store the data files.
  • The only supported file formats for Iceberg Datalake tables are Parquet, ORC and AVRO.
  • CREATE DATALAKE TABLE using the LIKE clause is not supported for Iceberg tables.
  • Partitioning by expression is restricted to Iceberg transform functions, except bucket[N] and void that are not supported.