CREATE DATALAKE TABLE statement

The CREATE DATALAKE TABLE statement defines a Db2® table that is based on a Hive table for the Datalake environment. While tables typically reside in a database, a Datalake table resides in an external file or group of files which can be defined as ORC, PARQUET or TEXTFILE format outside of a database.

Use a Datalake table to:
  • Store data outside the database while retaining the ability to query that data. To export data from the Db2 database into a Datalake table, specify the Datalake table as the target table in one of the following SQL statements:
    • INSERT SQL
    • SELECT INTO SQL
    • CREATE DATALAKE TABLE AS SELECT SQL
  • Import data from a Datalake table into a table in the database. You can perform operations such as casts, joins, and dropping columns to manipulate data during importing. To import data into the database from a Datalake table, use a FROM clause in a SELECT SQL statement as you would for any other table.
A Datalake table is of one of the following types:
Hive
A Hive Datalake table does not support ACID properties nor UPDATE and DELETE operations. This type of Datalake table is recommended for read-only or append-only tables.
Iceberg

Iceberg tables support ACID table properties, and is recommended for scenarios that require more flexibility on data manipulation operations. At this time, Db2 does not yet support all of the capabilities of Apache Iceberg. For more information about Iceberg tables, see Using Datalake tables.

To create an Iceberg Datalake table, use the STORED BY ICEBERG clause.

The definition must include its name and the names and attributes of its columns. The definition can include other attributes of the table, such as its primary key or check constraints, although the constraints are informational only. For more information, see Informational constraints.

Invocation

This statement can only be executed as a dynamic statement using EXECUTE IMMEDIATE. It is an executable statement that can be dynamically prepared only if DYNAMICRULES run behavior is in effect for the package (SQLSTATE 42509).

Authorization

The authorization ID of the statement must have either DBADM authority, or must have CREATETAB authority in combination with the following additional authorization:
  • One of the following privileges or authorities:
    • SYSADM authority
    • SYSCTRL authority
  • Plus one of these privileges or authorities:
    • CREATEIN privilege on the schema, if the schema name of the table refers to an existing schema
    • SCHEMAADM authority on the schema, if the schema name of the table refers to an existing schema
To define a foreign key, the authorization ID of the statement must have one of the following privileges for the parent table:
  • REFERENCES privilege on the table
  • REFERENCES privilege on each column of the specified parent key
  • CONTROL privilege on the table
  • SCHEMAADM authority on the schema, if the schema name of the parent table refers to an existing schema
  • DBADM authority
To use the LOCATION clause, the authorization ID of the statement must have one of the following privileges:
  • DATAACCESS authority
  • The following privileges on the files and the container (bucket):
    • READ and WRITE privileges on the container and all files in the file path, including files in any file paths that might be recursively specified in the LOCATION clause
    • EXECUTE privileges on the container and all the file paths in the specified LOCATION clause

Syntax

Read syntax diagramSkip visual syntax diagram CREATE DATALAKE TABLE IF NOT EXISTS table-name element-listLIKE1table-name2COMMENTstring-constantPARTITIONED BY partitioning-element-listROW FORMATdelimited-row-formatserde-row-formatSTORED AS file-formatSTORED BY ICEBERGTBLPROPERTIES(,string-constant1=string-constant2)LOCATION3string-constantas-result-table
element-list
Read syntax diagramSkip visual syntax diagram ( ,column-definitioncolumn-name4unique-constraintreferential-constraintcheck-constraint )
column-definition
Read syntax diagramSkip visual syntax diagram column-name data-type column-options
data-type
Read syntax diagramSkip visual syntax diagramprimitive-type
primitive-type
Read syntax diagramSkip visual syntax diagram5TINYINTSMALLINTINT2INTEGERINTINT4BIGINTINT8FLOAT(53)( integer)FLOAT8REALFLOAT4DOUBLEPRECISIONFLOAT8CHARACTERCHAR(integer)STRING6VARCHARCHARACTERCHARVARYING(integer)TIMESTAMPDATETIME(integer)DATEDECIMALDECNUMERICNUM(10,0)( precision-integer,0, scale-integer)BINARY(integer)VARBINARY(integer)BOOLEAN
column-options
Read syntax diagramSkip visual syntax diagramCOMMENTcolumn-commentNULLNOT NULLCONSTRAINTconstraint-nameconstraint-choicesconstraint-attributes
constraint-choices
Read syntax diagramSkip visual syntax diagramPRIMARY KEYUNIQUEreferences-clauseCHECK(check-condition)
references-clause
Read syntax diagramSkip visual syntax diagram REFERENCES table-name (,column-name)
check-condition
Read syntax diagramSkip visual syntax diagramsearch-conditionfunctional-dependency
functional-dependency
Read syntax diagramSkip visual syntax diagram column-name(,column-name) DETERMINED BY column-name(,column-name)
constraint-attributes
Read syntax diagramSkip visual syntax diagram NOT ENFORCEDTRUSTEDNOT TRUSTEDENFORCED ENABLE QUERY OPTIMIZATIONDISABLE QUERY OPTIMIZATION
unique-constraint
Read syntax diagramSkip visual syntax diagramCONSTRAINTconstraint-nameUNIQUEPRIMARY KEY ( ,column-name ) constraint-attributes
referential-constraint
Read syntax diagramSkip visual syntax diagram CONSTRAINTconstraint-name FOREIGN KEY (,column-name) references-clause constraint-attributes
check-constraint
Read syntax diagramSkip visual syntax diagram CONSTRAINTconstraint-name CHECK(check-condition)
partitioning-element-list
Read syntax diagramSkip visual syntax diagram ( ,column-definitioncolumn-namepartition-expression )
partition-expression
Read syntax diagramSkip visual syntax diagram expression AS identifier 7
delimited-row-format
Read syntax diagramSkip visual syntax diagram DELIMITEDFIELDS TERMINATED BYterminator-charESCAPED BYesc-char LINES TERMINATED BYterminator-charNULL DEFINED ASnull-value
serde-row-format
Read syntax diagramSkip visual syntax diagram SERDE serde-class WITH SERDEPROPERTIES(,string-constant1=string-constant2)
file-format
Read syntax diagramSkip visual syntax diagramORCPARQUETPARQUETFILETEXTFILEcustom-file-format
custom-file-format
Read syntax diagramSkip visual syntax diagram INPUTFORMAT class-name OUTPUTFORMAT class-name
as-result-table
Read syntax diagramSkip visual syntax diagram AS ( fullselect ) ctas-options 8
ctas-options
Read syntax diagramSkip visual syntax diagramWITH NO DATA
Notes:
  • 1 LIKE is not supported for ICEBERG tables (STORED BY ICEBERG).
  • 2 Each clause can be used only once.
  • 3 The LOCATION clause is mandatory.
  • 4 You do not need to specify values for column-name when you use as-result-table because the column details can be inferred from the query.
  • 5 The CREATE DATALAKE TABLE statement supports the data types that are described in Data types supported by DATALAKE tables and Data types. However, some data types are syntactically supported for compatibility with Hive but are treated as different data types within the Db2 runtime environment. For more information, see the mapping of data types that are used in the CREATE TABLE statement.
  • 6 Because the STRING data type has no specific size associated with it, the SQL processor might assume that each value in a STRING column always contains 32672 Bytes of data, and performance might be impacted. If performance is a concern, use types that contain a specific length.
  • 7 For Iceberg tables, the AS identifier clause is not supported. For non-Iceberg tables, it is required.
  • 8 When using CTAS, an error can occur if a data type is referenced in the select statement that is not support by the target table. For example, Iceberg does not support a TIMESTAMP with a precision greater than 6. Casting must be used in the select statement to a supported data type of the target table.

Description

DATALAKE
This keyword is required when creating a Datalake table. You can verify that you have defined a Datalake table by querying the PROPERTY column in the SYSCAT.TABLES catalog view. For example:
SELECT SUBSTRING(PROPERTY,22,1) FROM SYSCAT.TABLES WHERE TABNAME='<table-name>';
IF NOT EXISTS
Checks whether the specified table name already exists. If the name exists, no error is returned.
table-name
Specifies a unique name for the new table. The table name must be a valid identifier.

You can use the RENAME DATALAKE TABLE statement to rename a Datalake table. When you rename a Datalake table, the name in both the local database catalog and the Hive catalog is updated.

element-list
Defines the elements of a table, including the definition of columns and constraints on the table.
column-definition
Defines the attributes of the columns for a new Datalake table that is not derived from another table by using the AS clause.
column-name
Names a column of the table. The name cannot be qualified, and the same name cannot be used for more than one column of the table. The column name must be a valid identifier.
data-type
Specifies the data type of the column.
primitive-type
There are several built-in data types:
TINYINT
A 1-byte integer with a range of -128 to 127.

Db2 automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.

SMALLINT | INT2
A 2-byte integer with a precision of 5 digits and a range of -32,768 to 32,767.
INTEGER | INT | INT4
A 4-byte integer with a precision of 10 digits and a range of -2,147,483,648 to +2,147,483,647.
BIGINT | INT8
An 8-byte integer with a precision of 19 digits and a range of -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
FLOAT(integer) | FLOAT 8 | REAL | FLOAT4 | DOUBLE | FLOAT8
A single-precision or double-precision floating-point number. If the specified length is in the range:
  • 1 - 24, the number uses single precision.
  • 25 - 53, the number uses double-precision.

A single-precision floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.4028234663852886e+38 to -1.1754943508222875e-38, or from 1.1754943508222875e-38 to 3.4028234663852886e+38.

A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.

A REAL number is a single-precision floating-point number. A DOUBLE number is a double-precision floating-point number.

CHARACTER(integer) | CHAR(integer)
A character string is a sequence of characters. The length of the string (integer) is the number of characters in the sequence. The default is 1.
STRING | VARCHAR(integer) | CHARACTER VARYING(integer) | CHAR VARYING(integer)
The STRING data type has no specific size associated with it. The default is VARCHAR(32672).
TIMESTAMP | DATETIME
A six- or seven-part value (year, month, day, hour, minute, second, and optional fractional seconds) that designates a date and time.

If specified, integer must be between 0 and 9 and represents the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).

For Iceberg tables, the default precision is 6 and it is also the maximum for Apache Iceberg.

For Hive Datalake tables, the default precision is 9 and it is also the maximum.

Hive does not support a TIMESTAMP with a precision, which means that for Iceberg tables, the underlying storage will always have a precision of 6 for Iceberg and a precision of 9 for non-Iceberg tables.

DATE
A three-part value (year, month, and day) that designates a date. The range of the year part is 0001 to 9999. The range of the month part is 1 to 12. The range of the day part is 1 to x, where x depends on the month.
DECIMAL | DEC | NUMERIC | NUM (precision-integer, scale-integer)
A decimal number with defined precision and scale. The precision is the total number of digits and can range from 1 to 31. The scale is the number of digits to the right of the decimal point and can range from 0 to the precision of the number. The default precision is 10 and scale is 0.
BINARY | VARBINARY (integer)
A varying length binary value is specified as VARBINARY(length) or BINARY(length) and can be up to 32,672 bytes long. VARBINARY is an alias for BINARY, and they are functionally identical in every way. If no length is provided, the length is assumed to be 32,672.

However, it is recommended that an explicit length be specified. The Db2 engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.

BOOLEAN
A Boolean value.
column-options
You can define column options including a comment, the nullability attribute, or constraints.
COMMENT column-comment
A comment that provides additional information about the column and which is stored in the internal Hive metastore.
NULL | NOT NULL
NOT NULL specifies that the column will not contain null values. Because the Datalake table is external to any local database server, Db2 enforces nullability at query execution time (SQL5104N). The nullability attribute of a column is enforced during an INSERT operation. Do not assign a null value to a column that is defined as NOT NULL (SQLSTATE 23502).
CONSTRAINT constraint-name
Specifies the name of a constraint. Constraints defined on Datalake tables are informational only. For more information, see Informational constraints. You can define the following constraints on a column:
PRIMARY KEY
A primary key must be unique and must have the NOT NULL attribute. A table cannot have more than one primary key.
UNIQUE
A unique key prevents duplicate values in one or more NOT NULL columns of a table. A table can have multiple unique keys.
REFERENCES table-name
Ensures that required table relationships are maintained and that data entry rules are followed. The column on which this parameter is specified becomes the foreign key to the referenced parent table table-name.
CHECK check-condition
Sets restrictions on data that is added to a specific table. This parameter is enforced during INSERT operations or on data that is made available outside of the database manager (by using Hive operations, for example).
search-condition
The search-condition must be true or unknown for every row of the table, and has the following restrictions:
  • A column reference must pertain to a column of the table that is being created.
  • The search condition cannot contain a TYPE predicate.
  • The search condition cannot contain any of the following items (SQLSTATE 42621):
    • Subqueries
    • XMLQUERY or XMLEXISTS expressions
    • Dereference operations or DEREF functions where the scoped reference argument is other than the object identifier (OID) column
    • CAST specifications with a SCOPE clause
    • Column functions
    • Functions that are not deterministic
    • Functions that are defined to have an external action
    • User-defined functions that are defined with either MODIFIES SQL or READS SQL DATA
    • Host variables
    • Parameter markers
    • sequence-references
    • OLAP specifications
    • Special registers and built-in functions that depend on the value of a special register
    • Global variables
    • References to generated columns other than the identity column
    • References to columns of type XML (except in a VALIDATED predicate)
    • An error tolerant nested-table-expression

Check constraints with search-condition are enforced when rows in the table are inserted or updated. A check constraint that is defined on a table automatically applies to all subtables of that table.

functional-dependency
Defines a functional dependency between columns or sets of columns.

The parent set of columns precedes the DETERMINED BY clause, and the child set of columns follows the DETERMINED BY clause. All restrictions on search-condition apply to both parent set and child set columns, and only simple column references are allowed in the column sets (SQLSTATE 42621). The same column must not be identified more than once in a functional dependency (SQLSTATE 42709). No column in the child set of columns can be a nullable column (SQLSTATE 42621).

A functional dependency is not enforced by the database manager during normal operations such as insert, update, or delete. The functional dependency might be used during query rewrite for optimization. Incorrect results might be returned if the integrity of a functional dependency is not maintained.

constraint-attributes
Defines attributes that are associated with primary key, unique, referential integrity, or check constraints.
NOT ENFORCED | ENFORCED
Specifies whether the constraint is enforced by the database manager during normal operations such as insert, update, or delete.
NOT ENFORCED
Specifies that the constraint is not enforced by the database manager. This is the default for Datalake tables. A primary key or unique constraint cannot be NOT ENFORCED if there is a dependent ENFORCED referential constraint.
TRUSTED | NOT TRUSTED
Specifies whether the data can be trusted to conform to the constraint.
TRUSTED
Specifies that the data can be trusted to conform to the constraint. This is the default. Use this option only if the table data is independently known to conform to the constraint; otherwise, unpredictable results might occur.
NOT TRUSTED
Specifies that the data cannot be trusted to conform to the constraint. This option is valid only for referential integrity constraints (SQLSTATE 42613). Use this option when the data conforms to the constraint for most rows, but it is not independently known whether all current and future rows will conform. If a constraint is both not trusted and enabled for query optimization, it will not be used for any optimization that depends on the data being in complete conformity with the constraint.
ENFORCED
Although the ENFORCED keyword is available for compatibility with the syntax for local tables that are created with the CREATE TABLE statement, an error is returned when you specify this option for Datalake tables (SQLSTATE 42858).
ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION
Specifies whether the constraint or functional dependency can be used for query optimization under appropriate circumstances.
ENABLE QUERY OPTIMIZATION
Specifies that the constraint is assumed to be true and can be used for query optimization. This is the default.
DISABLE QUERY OPTIMIZATION
Specifies that the constraint cannot be used for query optimization. This option cannot be specified for primary key and unique constraints (SQLSTATE 42613).
column-name
Names a column of the source table when you create a new Datalake (target) table by using the AS clause. If a column name is not specified, column names for the target table are derived from the result set of a query against the source table. You cannot explicitly specify the data types of these columns, because the types are derived from data types in the result set.
unique-constraint
Defines a unique key. This constraint does not allow duplicate values in one or more columns of the table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL.
referential-constraint
Defines a referential constraint.
CONSTRAINT constraint-name
Specifies a name for the referential constraint.
FOREIGN KEY (column-name, ...)
The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the parent table, and the same column must not be identified more than once. The number of foreign key columns must be the same as the number of parent key columns, and the data types of the corresponding columns must be compatible (SQLSTATE 42830).
check-constraint
Defines a check constraint.
constraint-name
Specifies a name for the check constraint.
check-condition
See CHECK check-condition.
LIKE table-name
table-name
Specifies that the table columns are to have exactly the same name and description as the columns of a table, view, or nickname that exists in the catalog, or a declared temporary table. The table name can refer only to another Datalake table (SQLSTATE 42858).
COMMENT string-constant
Specifies a comment that provides additional information about the table and which is stored in the internal Hive metastore. The maximum length of string-constant is 254 bytes.
PARTITIONED BY partitioning-element-list
Specifies how the data is to be partitioned. The action of including the PARTITIONED BY clause creates a file in the file path that contains data for each unique value in the column that is being partitioned. The partitioned value must be a sequence of Unicode characters with UTF-8 encoding of up to 1,024 bytes long and are restricted to what can be included in the file path for object storage. Partition names are case sensitive. At query time, Db2 tries to scan only the file paths, or partitions, that contain data matching the query's search criteria.
column-definition

Specifies a new column definition that includes the column name, the column data type, and any column options. This will result in a new column being defined for the table and it will also be part of the partitioning-element-list. The following data types cannot be used as partitioning columns:

  • BINARY
  • BOOLEAN
  • DOUBLE
  • FLOAT
  • REAL
  • VARBINARY
Important: Although partitioning by TIMESTAMP is supported, it is not recommended. Partitioning on a TIMESTAMP value will result in many small files which can cause performances issues on reads and writes to the table.
Important: A partitioning key value cannot include any trailing blank spaces. This could be a problem if you plan to partition on a CHAR data type, because these values are always padded to their full length with blank spaces. Db2 also does not support partitioning on a zero-length string.
column-name

Specifies the name of a column that has been defined as part of the CREATE element-list that will be used as part of the partitioning-element-list.

partition-expression

Specifies one or more partitioning expressions by which data that is inserted into the table is partitioned. The action of including the PARTITIONED BY clause creates a file in the file path that contains data for each unique value in the partitioning expression. At query time, Db2 tries to scan only the file paths, or partitions, that contain data that match the search criteria of the query. For Hive Datalake tables, this creates a generated column in the table for the expression. For more information, see Generated columns.

For Iceberg tables, this creates an implicitly hidden generated column in the table for the expression. For more information see, Generated columns and Hidden columns.

For Hive Datalake tables, the expression can reference one or more columns in the columns list. The AS clause specifies a name for the column that is derived from the expression and is required. The data type of this identifier matches the result type of the expression. If a query references the column that is derived from the expression, a new predicate is automatically injected into the query so that Db2 scans only the qualifying partitions.

For Iceberg tables, the AS clause is not required and is not supported. In addition, the only valid partition expressions that are supported by Iceberg Datalake tables are the following:
identity
Directly references an existing column in the table. For example, if a table has a column named C1, you can include a partition clause as PARTITION BY C1.
truncate[W]
Truncates the column value to the specified width (W) to create the partition value.
year
Extracts a date or timestamp year from the column value to create the partition value.
month
Extracts a date or timestamp month from the column value to create the partition value.
day
Extracts a date or timestamp day from the column value to create the partition value.
hour
Extracts a timestamp hour from the column value to create the partition value.
The partitioning expression cannot contain any of the following items (SQLSTATE 42621):
  • Subqueries
  • Column functions
  • Dereference operations or DEREF functions
  • User-defined or built-in functions that are non-deterministic
  • User-defined functions that use the EXTERNAL ACTION option
  • User-defined functions that are not defined with NO SQL
  • Host variables or parameter markers
  • Special registers or built-in functions that depend on the value of a special register
  • Global variables
  • An expression on a partitioning expression column
ROW FORMAT
Specifies the format of data rows.
delimited-row-format
Specifies a delimiter at the table level.
DELIMITED
Valid values for the FIELDS TERMINATED BY and ESCAPED BY parameters are the following items.
Important: It is recommended to use different values for each of these parameters; otherwise, the behavior is undefined.
  • A literal single-byte character such as vertical bar (|) or comma (,).
  • A value of the format \nnn, where nnn is the octal value of a character. For example, \001 is the ASCII character that represents CTRL-A.
FIELDS TERMINATED BY terminator-char
Specifies a delimiter for structured fields. You can include any single character, but the default is \001, the ASCII character that represents CTRL-A. Characters that match the field terminator character can be part of your data if you specify the ESCAPED BY clause and those characters are properly escaped.
LINES TERMINATED BY terminator-char
Specifies a delimiter for data rows. The only valid delimiter is a newline character (\n). Newline characters must not exist within your data because they cannot be escaped.
NULL DEFINED AS null-value
Specifies a value to represent the null value. The default is a literal backslash character followed by an uppercase N (\N).
serde-row-format
Specifies the format of data rows as a SerDe class.
STORED AS file-format

Specifies the format of the underlying data files. For more information about the supported file formats, see Supported file formats.

The file format can also be specified as Java input and output classes, and this is shown in the output of the db2look command.

INPUTFORMAT class-name
Specifies the name of the Java class that implements the input format interface.
OUTPUTFORMAT class-name
Specifies the name of the Java class which implements the output format interface. For example, for a Datalake table STORED AS PARQUET, the following example output is returned after running the db2look command:

ROW FORMAT SERDE 
    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS 
    INPUTFORMAT 
        'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
STORED BY ICEBERG
Specifies that the Datalake table is an Iceberg Datalake table.
LOCATION string-constant

Specifies the location of the table data. The location must be a DB2REMOTE identifier containing an access storage alias that establishes connectivity to the remote storage where the table data is to reside. For more information, see Remote storage connectivity for Datalake tables. Access to the file path that is specified in string-constant can be subject to additional file permissions depending on the storage service being used.

Note: The LOCATION clause must always specify a directory that contains the files for the table that is being created. The files must have the same structure. If you have files with different structures and you want to create a different table for each structure, the files must be in separate directories.
TBLPROPERTIES ('string-constant1' = 'string-constant2')
Defines job properties that can configure input or output formats, and that you can use to specify more detailed information about the table. The properties are passed as is to the underlying InputFormat and OutputFormat classes. Properties that are not implemented by these classes are ignored. Db2 does not validate the properties.
as-result-table
Specifies that the table is to be based on a SELECT statement.
AS (fullselect)
Specifies that, for each column in the derived result table of the fullselect, a corresponding column is to be defined for the table. Each defined column adopts the following attributes from its corresponding column of the result table (if applicable to the data type):
  • Column name
  • Column description
  • Data type, length, precision, and scale
  • Nullability
The following attributes are not included (although the default value and identity attributes can be included by using copy-options):
  • Default value
  • Identity attributes
  • Hidden attribute
  • ROW CHANGE TIMESTAMP
  • Any other optional attributes of the tables or views that are referenced in the fullselect
The following restrictions apply:
  • Every select list element must have a unique name (SQLSTATE 42711). The AS clause can be used in the select clause to provide unique names.
  • The fullselect cannot refer to host variables or include parameter markers.
  • The data types of the result columns of the fullselect must be data types that are valid for columns of a table.
  • If row or column level access control (RCAC) is activated for any table that is specified in the fullselect, RCAC is not cascaded to the new table.
  • The fullselect cannot include a data-change-table-reference clause (SQLSTATE 428FL).
  • Any valid fullselect that does not reference a typed table or a typed view can be specified.
WITH NO DATA
Specifies that the target table is not to be populated with data.

Usage notes

  • The maximum number of Datalake tables that can be created in Db2 is 1,048,576.
  • If the current schema, which might have been created with either the USE or SET SCHEMA command, does not exist in the Hive metastore, an attempt is made to automatically create the schema in Hive.
  • All referential integrity constraints and nullability indicators are advisory only and are not always enforced.
  • When you use data types such as STRING and BINARY, the SQL processor might assume that such cells always contain 32672 Bytes of data, which can impact performance. Instead, use data types that have a specific length. To improve performance, use a VARCHAR(n) with a defined size instead of a STRING.

    This alter operation changes the definition of the table both in Db2 and in the Hive Metastore (HMS). If data in the VARCHAR(n) column exceeds <size>, the data is returned, but it is truncated.

  • If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(6) for Iceberg tables and TIMESTAMP(9) for Hive Datalake tables.
  • If you create a table in a schema that does not exist, the schema is automatically created with the default name (your user ID).
  • To prevent the risk of data loss, creation of Iceberg tables in a location where Iceberg metadata files already exist is blocked. To create a table in this location, delete the data and metadata from the storage first.
  • CMX compression is supported in Db2. To enable CMX compression, set the following compression properties before running the CREATE Datalake TABLE statement:
    SET HADOOP PROPERTY 'mapred.output.compress'='true';
    SET HADOOP PROPERTY 'mapred.output.compression.type'='BLOCK';
    SET HADOOP PROPERTY 'mapred.output.compression.codec'='org.apache.hadoop.io.compress.CmxCodex';
  • For information about the data type mappings that are supported in the CREATE Datalake TABLE syntax, see Data types supported by Datalake tables.
  • Datalake tables and local Db2 tables all use the same namespace. If you synchronize a table from the IBM watsonx.data to Db2, ensure that these tables do not conflict with existing Db2 tables.

Restrictions

For information about restrictions, see Restrictions and limitations on Datalake tables.

Examples

To create a Datalake table (t1) with a Parquet file format, run the following statement:
CREATE DATALAKE TABLE t1 (i INT, s VARCHAR(10)) STORED AS PARQUET LOCATION 'DB2REMOTE://odfdefault//defaultbucket/t1' TBLPROPERTIES ('external.table.purge'='true');
To create a Datalake table (t2) by using the structure and contents of another Datalake table, run the following statement:
CREATE DATALAKE TABLE t2 LOCATION 'DB2REMOTE://odfdefault//defaultbucket/t2' as (SELECT * FROM t1);
To create a Datalake table (orders) with a partitioning column, run the following statement:
CREATE DATALAKE TABLE orders (orderkey INT NOT NULL, o_comment VARCHAR(20) COMMENT 'comment column' NOT NULL ) PARTITIONED BY (orderdate VARCHAR(10)) LOCATION 'DB2REMOTE://odfdefault//defaultbucket/orders';
To create an Iceberg table (t1_ice) with a default Parquet file format, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) stored by iceberg location 'db2remote://default//t1_ice';
To create an Iceberg table (t1_ice) with Parquet file format, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) stored by iceberg stored as parquet location 'db2remote://default//t1_ice';
To create an Iceberg table (t1_ice) with Orc file format, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) stored by iceberg stored as orc location 'db2remote://default//t1_ice';
To create an Iceberg table (t1_ice) specifying the external.table.purge property, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) stored by iceberg location 'db2remote://default//t1_ice' TBLPROPERTIES('external.table.purge'='true');
To create an Iceberg table (t1_ice) partitioned by the age column, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) partitioned by(age) stored by iceberg location 'db2remote://default//t1_ice';
To create an Iceberg table (t1_ice) partitioned by the year expression applied to the birth_date column, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) partitioned by(year(birth_date)) stored by iceberg location 'db2remote://default//t1_ice';
To create an Iceberg table (t1_ice) partitioned by the month expression applied to the birth_date column, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) partitioned by(month(birth_date)) stored by iceberg location 'db2remote://default//t1_ice';
To create an Iceberg table (t1_ice) partitioned by the truncate expression applied to the name column, run the following statement:
CREATE DATALAKE TABLE t1_ice(id int, age int, name varchar(30), birth_date date) partitioned by(truncate(10, name)) stored by iceberg location 'db2remote://default//t1_ice';