Using Datalake tables in Db2 Warehouse

While tables typically reside in a database, a Datalake table resides in a file or group of files that can be defined as ORC, PARQUET, AVRO, or TEXTFILE formats, on remote storage, outside of the database.

Note: The Datalake table feature is supported only on Red Hat OpenShift and Kubernetes operating systems with a Db2 configuration that has 2 or more nodes. In 11.5.9, the feature is supported only on Linux on x86 hardware. Starting in Db2 12.1.1, the Datalake table feature is supported on IBM POWER LE for new deployments only.

For more information about file formats and what formats are supported, see Supported file formats for Datalake tables.

For more information about supported object storage services, see Supported object storage services for Datalake tables.

Introduction to Datalake tables

The new Datalake table support provides Db2 Warehouse users with the ability to access data that is stored in open data formats like PARQUET and ORC from within the warehouse.

Use a Datalake table to:

  • Access external data by defining a Datalake table in Db2. This Datalake table can then be used with other Db2 tables in complex queries.
  • Export Db2 data to object storage, while retaining the ability to query that data. To export data from the Db2 database into a Datalake table, specify the Datalake table as the target table in one of the following SQL statements:
    • INSERT SQL
    • SELECT INTO SQL
    • CREATE Datalake TABLE AS SELECT SQL
  • Import data from a Datalake table into a table in the database. You can perform operations such as casts, joins, and dropping columns to manipulate data during importing. To import data into the database from a Datalake table, use a FROM clause in a SELECT SQL statement as you would for any other table.
A Datalake table is of one of the following types:
Hive
The Datalake table is not defined as STORED BY ICEBERG and does not support ACID properties. This type of Datalake table is recommended for read-only or append-only tables. These tables are often referred to as ODF (Open Data Format) tables.
Iceberg
The Datalake table supports ACID table properties, and is recommended for scenarios which requires more flexibility on data manipulation operations. To define an Iceberg table, specify the STORED BY ICEBERG clause. For more information, see CREATE DATALAKE TABLE statement.

Datalake tables

The support for Datalake tables builds on the architecture used in Big SQL to access open data format (ODF) data which is stored in object storage. For more information, see Big SQL architecture for Datalake tables.

The information in this section provides an overview of some of the key aspects of Datalake tables. These are applicable to both Hive Datalake and Iceberg tables.

Data Definition Language (DDL) processing

A Db2 autonomous procedure (autonomous transaction) is used for Datalake table DDL statement handling since DDL operations need to be autonomous with updates to Hive Metastore (HMS) and Db2 catalogs. DDL statements can only be executed via the EXECUTE IMMEDIATE statement; there is no support for PREPARE / EXECUTE of DDL. Unlike other Db2 DDL statements, these are not within Db2 transactional control. Each statement is either successful and committed, or it fails and is rolled back. When a Datalake table is created, if the schema for the table does not yet exist in the HMS, a schema is automatically created. For a Db2 DROP SCHEMA statement to succeed, the hive schema must be dropped first, using the DROP DATALAKE SCHEMA ON DB2 SCHEMA statement. When a Db2 DROP SCHEMA statement is executed, an attempt is made to drop the schema in the HMS as well.

For more information about Datalake table DDL handling, see:

SYSHADOOP functions, procedures, and views
Several functions, procedures, and views have been provided to facilitate the usage of Datalake tables. For more information, see:

SYSHADOOP routines are implemented in Java and use Java heap. Starting in version 12.1.2, the maximum size of this Java heap can be set using the sys_java_heap_sz database manager configuration parameter.

Statistics collection

As with Db2 local tables, it is critical to have up-to-date statistics for Datalake tables to improve query performance. The ANALYZE TABLE statement is used to collect statistics on Datalake tables and is similar to the RUNSTATS command used for local Db2 tables. For more information on ANALYZE TABLE handling, see Gathering statistics about Datalake tables.

Support for Row and Column Access Control (RCAC)

Although Row and Column Access Control (RCAC) is supported for Datalake tables, the use of RCAC for Datalake tables provides access control only when users access the Datalake table data using the Db2 database manager. The remote storage used for the Datalake table still requires a separate solution for data security. For more details on how to use RCAC, see Row and column access control (RCAC) overview.

Differences between Db2 local tables and Datalake tables
There are some key differences between Db2 local tables and Datalake tables:
  • The data is not owned by Db2, which has some key implications:
    • The schema of the table can change externally.
    • Table data can be changed externally, which might negatively impact performance if statistics are not updated to reflect the changes to the data.
  • Datalake tables are not associated with a Db2 tablespace. The TBSPACEID in SYSCAT.TABLES catalog shows a -12 for Datalake tables.
  • When altering a Datalake table, only one alter is possible per statement. Altering multiple elements requires running multiple ALTER Datalake TABLE statements.
  • Isolation levels are not applicable when querying Datalake tables.
  • Operations on Datalake tables (both DDL and DML) are outside of Db2 transactional control.
  • Unlike Db2 local table names, Datalake table names must be unique across all tenants, and not just within a tenant.

For more information regarding Datalake table limitations and restrictions, see Restrictions and limitations on Datalake tables.

Datalake table locations and Datalake schemas

Datalake tables require a location to determine where to store their data. You can set a location by using a Datalake schema. Using a Datalake schema relieves the table creator from passing in a location and provides a predictable and consistent location for Datalake tables stored in the same Datalake schema. Alternatively, you can also set a location by using the LOCATION clause as part of the CREATE DATALAKE TABLE statement.

A Datalake schema is a container for Datalake tables that can optionally have a LOCATION.
  • Datalake tables created in that schema without an explicit location are stored in a subdirectory of the schema location. The subdirectory name is the name of the table.
  • Datalake tables created in that schema with an explicit location are stored in the set location

Data Manipulation (DML) processing

For Hive Datalake tables, only INSERT and SELECT statements are supported. For Iceberg Datalake tables, INSERT, UPDATE and DELETE operations are supported. Note that DML operations on Datalake tables are not part of Db2 transactional processing. INSERT, UPDATE and DELETE operations on Datalake tables cannot be rolled back as part of a Db2 logical unit of work.

Iceberg Datalake tables

Iceberg provides advantages over other table formats in that it manages individual data files instead of directories. This design allows data files to be written in-place and are only added to the table through an explicit commit that ensures isolated reads and writes. Data files are immutable after being written. Deleted and updated rows are written to separate data files.

Changes made to an Iceberg table result in the creation of a new metadata file (a new snapshot) referencing the current schema, partition definitions, properties, and data files. This design ensures that at any point in time, a table’s schema and exact data files are known resulting in such features as schema evolution, time travel and rollback, and data compaction to be supported. These features make a compelling case for using the table format.

The Db2 Warehouse SaaS release adds support for creating and dropping Iceberg tables, including create table as select (CTAS), adding externally created Iceberg tables to Db2, read, and insert support. With CTAS, you can create and populate an Iceberg table with data from existing Db2 or Datalake tables.

Differences between Iceberg and Hive Datalake tables

There are several differences between Iceberg and Hive Datalake tables. For example, unlike Hive Datalake tables, Iceberg tables fully manage their data, even when it’s stored on external storage.

The following list are some of the key differences:
  • It is not possible to add more data to an Iceberg table by adding a file in its data directory. The list of files making up the content of the table are tracked through separate metadata and will be ignored. Likewise, it is not possible to delete data by manually removing files from the table data directory. Data can be added to the data directory by Db2 as well as other Iceberg data engines like Spark by using Iceberg APIs. Unlike Hive Datalake tables, this data is visible immediately by queries against that data.
  • It is not possible to catalog an existing Iceberg table into Db2 by issuing a CREATE DATALAKE TABLE statement and indicating the Iceberg table location with the table LOCATION clause. Use the EXT_METASTORE_SYNC stored procedure procedure to import the table into Db2.
  • The directory structure for Iceberg tables differs from Hive Datalake tables. Iceberg tables include sub-paths for the metadata and data files that belong to the table. If, for example, you create a table called T1, the sub-paths for that table are T1/metadata and T1/data. T1/metadata is used to store the metadata, manifest, and files. T1/data is used to store the data files.
  • The only supported file formats are Parquet, ORC and AVRO.
  • CREATE DATALAKE TABLE using the LIKE clause is not supported.
  • Partitioning by expression is restricted to Iceberg transform functions, except bucket[N] and void that are not supported.

Resources