Transaction handling for Datalake tables
The operations on Datalake tables are outside of the Db2 transactional control. This includes both DDL operations as well as INSERT statements.
The operations on Datalake tables are not transactional; they are either successful or fail and the operations cannot be rolled back. This means that errors can occur when running concurrent DDL (CREATE/DROP/ALTER), DCL (GRANT / REVOKE), INSERT, and SELECT on Hive Datalake tables. As described later, the Big SQL I/O engine attempts to mitigate issues of this nature, but errors can still occur from open source components like Hive.
It is also important to know that the Hive Metatore (HMS) used for Datalake tables is also stored in the Db2 database, which also impacts the transactional processing.
Data Definition Language (DDL) transaction handling
If the DDL statement on a Datalake table is successful, it will auto commit. If the DDL statement fails, it will automatically roll back.
When you run a CREATE DATALAKETABLE statement that contains an AS full-select clause (CTAS statement), the statement is processed as two separate statements: a CREATE DATALAKE TABLE and an INSERT with SELECT. The two statements are run in sequence, independently of each other.
- Run the INSERT with SELECT statement on the table.
- Drop the table and run the CREATE TABLE AS statement again.
Data Manipulation Language (DML) transaction handling
INSERT into Datalake tables is also not a Db2 transactional operation.
- When an INSERT into a Datalake table successfully completes, files that contain the inserted rows are created permanently in object storage.
- Db2 COMMIT and ROLLBACK statements have no effect on a completed Datalake INSERT statement.
Rows inserted by an INSERT statement are visible to SELECT statements only after the entire INSERT statement has completed successfully. An SQL exclusive lock on the table occurs during the last phase of the INSERT when files are moved to the file path that represents the table in object storage. Each invocation of an INSERT statement produces a data file in object storage, which can negatively impact both INSERT performance as well as SELECT performance. Do not use the INSERT ... VALUES operation on Datalake tables to add large quantities of data into a Datalake table. This operation cannot be parallelized. You can use the INSERT INTO ... VALUES operation for the purposes of testing or for populating very small tables in which all of the rows can be inserted in a single operation.
Hive Metastore (HMS) transaction handling
Handling of Datalake tables requires accessing both the Db2 catalog as well as the Hive Metastore (HMS). Since Datalake table handling is outside of Db2 transactional control, it is recommended that Db2 DDL handling be committed prior to executing operations against Datalake tables.
create role role 1;
grant role1 to user user2;
grant dbadm on database to user user3;
insert into mydatalake table select ... ;