Using Datalake tables in Db2 Warehouse
While tables typically reside in a database, a Datalake table resides in a file or group of files that can be defined as ORC, PARQUET, AVRO, or TEXTFILE formats, on remote storage, outside of the database.
For more information about file formats and what formats are supported, see Supported file formats for Datalake tables.
For more information about supported object storage services, see Supported object storage services for Datalake tables.
Introduction to Datalake tables
The new Datalake table support provides Db2 Warehouse users with the ability to access data that is stored in open data formats like PARQUET and ORC from within the warehouse.
Use a Datalake table to:
- Access external data by defining a Datalake table in Db2. This Datalake table can then be used with other Db2 tables in complex queries.
- Export Db2 data to object storage, while retaining the ability to query that data. To export
data from the Db2 database into a Datalake table, specify the Datalake table as the target table in
one of the following SQL statements:
- INSERT SQL
- SELECT INTO SQL
- CREATE Datalake TABLE AS SELECT SQL
- Import data from a Datalake table into a table in the database. You can perform operations such as casts, joins, and dropping columns to manipulate data during importing. To import data into the database from a Datalake table, use a FROM clause in a SELECT SQL statement as you would for any other table.
- Hive
- The Datalake table is not defined as STORED BY ICEBERG and does not support ACID properties. This type of Datalake table is recommended for read-only or append-only tables. These tables are often referred to as ODF (Open Data Format) tables.
- Iceberg
- The Datalake table supports ACID table properties, and is recommended for scenarios which requires more flexibility on data manipulation operations. To define an Iceberg table, specify the STORED BY ICEBERG clause. For more information, see the CREATE DATALAKE TABLE statement.
Datalake tables
The support for Datalake tables builds on the architecture used in Big SQL to access open data format (ODF) data which is stored in object storage. For more information, see Big SQL architecture for Datalake tables.
The information in this section provides an overview of some of the key aspects of Datalake tables. These are applicable to both Hive Datalake and Iceberg tables.
- Data Definition Language (DDL) processing
-
A Db2 autonomous procedure (autonomous transaction) is used for Datalake table DDL statement handling since DDL operations need to be autonomous with updates to Hive Metastore (HMS) and Db2 catalogs. DDL statements can only be executed via the EXECUTE IMMEDIATE statement; there is no support for PREPARE / EXECUTE of DDL. Unlike other Db2 DDL statements, these are not within Db2 transactional control. Each statement is either successful and committed, or it fails and is rolled back. When a Datalake table is created, if the schema for the table does not yet exist in the HMS, a schema is automatically created. When a Db2 DROP SCHEMA statement is executed, an attempt is made to drop the schema in the HMS as well. There is no Datalake-specific DROP SCHEMA statement.
For more information about Datalake table DDL handling, see:
- SYSHADOOP functions, procedures, and views
-
Several functions, procedures, and views have been provided to facilitate the usage of Datalake tables. For more information, see:
SYSHADOOP routines are implemented in Java and use Java heap. Starting in version 12.1.2, the maximum size of this Java heap can be set using the sys_java_heap_sz database manager configuration parameter.
- Statistics collection
-
As with Db2 local tables, it is critical to have up-to-date statistics for Datalake tables to improve query performance. The ANALYZE TABLE statement is used to collect statistics on Datalake tables and is similar to the RUNSTATS command used for local Db2 tables. For more information on ANALYZE TABLE handling, see Gathering statistics about Datalake tables.
- Support for Row and Column Access Control (RCAC)
-
Although Row and Column Access Control (RCAC) is supported for Datalake tables, the use of RCAC for Datalake tables provides access control only when users access the Datalake table data using the Db2 database manager. The remote storage used for the Datalake table still requires a separate solution for data security. For more details on how to use RCAC, see Row and column access control (RCAC) overview.
- Differences between Db2 local tables and Datalake tables
-
There are some key differences between Db2 local tables and Datalake tables:
- The data is not owned by Db2, which has some key implications:
- The schema of the table can change externally.
- Table data can be changed externally, which might negatively impact performance if statistics are not updated to reflect the changes to the data.
- Datalake tables are not associated with a Db2 tablespace. The TBSPACEID in
SYSCAT.TABLES catalog shows a
-12
for Datalake tables. - When altering a Datalake table, only one alter is possible per statement. Altering multiple elements requires running multiple ALTER Datalake TABLE statements.
- Isolation levels are not applicable when querying Datalake tables.
- Operations on Datalake tables (both DDL and DML) are outside of Db2 transactional control.
- Unlike Db2 local table names, Datalake table names must be unique across all tenants, and not just within a tenant.
For more information regarding Datalake table limitations and restrictions, see Restrictions and limitations on Datalake tables.
- The data is not owned by Db2, which has some key implications:
Data Manipulation (DML) processing
For Hive Datalake tables, only INSERT and SELECT statements are supported. For Iceberg Datalake tables, INSERT, UPDATE and DELETE operations are supported. Note that DML operations on Datalake tables are not part of Db2 transactional processing. INSERT, UPDATE and DELETE operations on Datalake tables cannot be rolled back as part of a Db2 logical unit of work.