Using Datalake tables

While tables typically reside in a database, a Datalake table resides in an external file or group of files that can be defined as ORC, PARQUET, or TEXTFILE formats outside of a database.

For more information about file formats and what formats are supported, see Supported file formats for Datalake tables.

Important: For a list of supported object storage services, see this link.

Introduction to Datalake tables

The new Datalake table support provides Db2 Warehouse on Cloud users with the ability to access data that is stored in open data formats like PARQUET and ORC from within the warehouse.

Use a Datalake table to:

  • Access external data by defining a Datalake table in Db2. This Datalake table can then be used with other Db2 tables in complex queries.
  • Export Db2 data to object storage, while retaining the ability to query that data. To export data from the Db2 database into a Datalake table, specify the Datalake table as the target table in one of the following SQL statements:
    • INSERT SQL
    • SELECT INTO SQL
    • CREATE Datalake TABLE AS SELECT SQL
  • Import data from a Datalake table into a table in the database. You can perform operations such as casts, joins, and dropping columns to manipulate data during importing. To import data into the database from a Datalake table, use a FROM clause in a SELECT SQL statement as you would for any other table.
A Datalake table is of one of the following types:
Regular
The Datalake table is not defined as STORED BY ICEBERG and does not support ACID properties. This type of Datalake table is recommended for read-only or append-only tables. These tables are often referred to as ODF (Open Data Format) tables.
Iceberg
The Datalake table supports ACID table properties, and is recommended for scenarios which requires more flexibility on data manipulation operations. To define an Iceberg table, specify the STORED BY ICEBERG clause. For more information, see the CREATE DATALAKE TABLE statement.

Datalake tables

The support for Datalake tables builds on the architecture used in Big SQL to access open data format (ODF) data which is stored in object storage. For more information, see Big SQL architecture for Datalake tables.

The information in this section provides an overview of some of the key aspects of Datalake tables. These are applicable to both regular Datalake and Iceberg tables.

Data Definition Language (DDL) processing

A Db2 autonomous procedure (autonomous transaction) is used for Datalake table DDL statement handling since DDL operations need to be autonomous with updates to Hive Metastore (HMS) and Db2 catalogs. DDL statements can only be executed via the EXECUTE IMMEDIATE statement; there is no support for PREPARE / EXECUTE of DDL, and these statements can only be executed on the Db2 catalog node. Unlike other Db2 DDL statements, these are not within Db2 transactional control. Each statement is either successful and committed, or it fails and is rolled back. When a Datalake table is created, if the schema for the table does not yet exist in the HMS, a schema is automatically created. When a Db2 DROP SCHEMA statement is executed, an attempt is made to drop the schema in the HMS as well. There is no Datalake-specific DROP SCHEMA statement.

For more information about Datalake table DDL handling, see:

SYSHADOOP functions, procedures, and views
Several functions, procedures, and views have been provided to facilitate the usage of Datalake tables. For more information, see:
Statistics collection

As with Db2 local tables, it is critical to have up-to-date statistics for Datalake tables to improve query performance. The ANALYZE TABLE statement is used to collect statistics on Datalake tables and is similar to the RUNSTATS command used for local Db2 tables. For more information on ANALYZE TABLE handling, see Gathering statistics about Datalake tables.

Support for Row and Column Access Control (RCAC)

Although Row and Column Access Control (RCAC) is supported for Datalake tables, the use of RCAC for Datalake tables provides access control only when users access the Datalake table data using the Db2 database manager. The remote storage used for the Datalake table still requires a separate solution for data security. For more details on how to use RCAC, see Row and column access control (RCAC) overview.

Differences between Db2 local tables and Datalake tables
There are some key differences between Db2 local tables and Datalake tables:
  • The data is not owned by Db2, which has some key implications:
    • The schema of the table can change externally.
    • Table data can be changed externally, which might negatively impact performance if statistics are not updated to reflect the changes to the data.
  • Datalake tables are not associated with a Db2 tablespace. The TBSPACEID in SYSCAT.TABLES catalog shows a -12 for Datalake tables.
  • When altering a Datalake table, only one alter is possible per statement. Altering multiple elements requires running multiple ALTER Datalake TABLE statements.
  • Isolation levels are not applicable when querying Datalake tables.
  • Operations on Datalake tables (both DDL and DML) are outside of Db2 transactional control.

For more information regarding Datalake table limitations and restrictions, see Restrictions and limitations on Datalake tables.

Iceberg Datalake tables

Db2 Warehouse on Cloud supports the Apache Iceberg table format. Apache Iceberg is an open-source table format designed for handling large, terabyte and petabyte scale datasets making it ideally suited for large scale data analytics. It provides standard SQL support making it easy to create and manage tables along with processing the data stored in those tables.

Iceberg provides advantages over other table formats in that it manages individual data files instead of directories. This design allows data files to be written in-place and are only added to the table through an explicit commit that ensures isolated reads and writes. Data files are immutable after being written. Deleted and updated rows are written to separate data files.

Changes made to an Iceberg table result in the creation of a new metadata file (a new snapshot) referencing the current schema, partition definitions, properties, and data files. This design ensures that at any point in time, a table’s schema and exact data files are known resulting in such features as schema evolution, time travel and rollback, and data compaction to be supported. These features make a compelling case for using the table format.

The Db2 Warehouse on Cloud release adds support for creating and dropping Iceberg tables, including create table as select (CTAS), adding externally created Iceberg tables to Db2, read, and insert support. With CTAS, you can easily create and populate an Iceberg table with data from existing Db2 or Datalake tables.

Differences between Iceberg and regular Datalake tables

In contrast to regular Datalake tables, Iceberg tables fully manage their data, even when it’s stored on external storage. This has several implications that users must be aware about:
  • It is not possible to add more data to an Iceberg table by adding a file in its data directory. The list of files making up the content of the table are tracked through separate metadata and will be ignored. Likewise, it’s not possible to delete data by manually removing files from the table data directory. Data can be added to the data directory by Db2 as well as other Iceberg data engines like Spark by using Iceberg APIs. Unlike regular Datalake tables, this data is visible immediately by queries against that data.
  • It is not possible to catalog an existing Iceberg table into Db2 by issuing a CREATE DATALAKE TABLE statement and indicating the Iceberg table location with the table LOCATION clause. Use the EXT_METASTORE_SYNC procedure to import the table into Db2.
  • The directory structure for Iceberg tables differs from regular Datalake tables. Iceberg tables include sub-paths for the metadata and data files that belong to the table. If, for example, you create a table called T1, the sub-paths for that table are T1/metadata and T1/data. T1/metadata is used to store the metadata, manifest, and files. T1/data is used to store the data files.
  • The only supported file formats are Parquet, ORC and AVRO.
  • CREATE Datalake TABLE using the LIKE clause is not supported.
  • Partitioning by expression is restricted to Iceberg transform functions, except bucket[N] and void that are not supported.
  • Partitioning by expression does not require you to create a new column as part of the partition expression clause.

Resources