CREATE TABLE (HADOOP) statement

The CREATE TABLE (HADOOP) statement defines a Db2® Big SQL table that is based on a Hive table for the Hadoop environment. The definition must include its name and the names and attributes of its columns. The definition can include other attributes of the table, such as its primary key or check constraints.

Invocation

This statement can be embedded in an application program or issued by using dynamic SQL statements. It is an executable statement that can be dynamically prepared only if DYNAMICRULES run behavior is in effect for the package (SQLSTATE 42509).

Authorization

The authorization ID of the statement must have either DBADM authority, or must have CREATETAB authority in combination with the following additional authorization:

One of the following privileges or authorities:
- USE privilege on the table space
- SYSADM authority
- SYSCTRL authority
Plus one of these privileges or authorities:
- IMPLICIT_SCHEMA authority on the database, if the implicit or explicit schema name of the table does not exist
- CREATEIN privilege on the schema, if the schema name of the table refers to an existing schema
- SCHEMAADM authority on the schema, if the schema name of the table refers to an existing schema

To define a foreign key, the authorization ID of the statement must have one of the following privileges for the parent table:

REFERENCES privilege on the table
REFERENCES privilege on each column of the specified parent key
CONTROL privilege on the table
SCHEMAADM authority on the schema, if the schema name of the parent table refers to an existing schema
DBADM authority

To use the LOCATION clause, the authorization ID of the statement must have one of the following privileges:

DATAACCESS authority
The following privileges on the files and the directory in the HDFS:
- READ and WRITE privileges on the directory and all files in the directory, including files in any subdirectories that might be recursively specified in the LOCATION clause
- EXECUTE privileges on all the directories and subdirectories in the specified LOCATION clause

If impersonation is not enabled, the bigsql user needs the same permissions that the Hive user needs for this operation. For example, with impersonation enabled in Hive, the bigsql user must have READ, WRITE, and EXECUTE privileges on the directory that is specified in the LOCATION clause, and READ and EXECUTE privileges on the parent directory.

Syntax

element-list

column-definition

data-type

primitive-type

array-type

row-type

column-options

constraint-choices

references-clause

check-condition

functional-dependency

constraint-attributes

unique-constraint

referential-constraint

check-constraint

partitioning-element-list

partition-expression

delimited-row-format

serde-row-format

file-format

custom-file-format

storage-handler

as-result-table

ctas-options

data-options

refresh-options

maintain-options

optimization-options

Notes:

¹ The HADOOP keyword is required. For more information about this requirement, see the Hadoop keyword usage note.
² For compatibility with Hive, when you use the LIKE clause with a table name, you can specify only the LOCATION, TBLPROPERTIES, or HINTS clause. This clause is not supported for Hadoop MQTs.
³ Db2 Big SQL 7.2 and later service This feature requires the Db2 Big SQL 7.2 and later service on IBM Cloud Pak® for Data 4.0 and later. For compatibility with Hive, when you use the LIKE clause with a URL, you can specify only the STORED AS, PARTITIONED BY, LOCATION, TBLPROPERTIES, or HINTS clause. This clause is not supported for Hadoop MQTs.
⁴ Each clause can be used only once.
⁵ Specify values for column-name only when you are creating a Hadoop MQT or, optionally, when you are using the AS keyword, not when you are creating a regular Hadoop table. You do not need to specify values for column-name when you use the AS keyword, because the column details can be inferred from the query. Similarly, you do not need to specify values for column-name when you create a table by using a SerDe that dynamically determines the column list from an external data source, such as an Avro schema.
⁶ Db2 Big SQL supports the data types that are described in Data types. However, some data types are syntactically supported for compatibility with Hive but are treated as different data types within the Big SQL runtime environment. For more information, see the mapping of data types that are used in the CREATE TABLE statement.
⁷ Because the STRING data type has no specific size associated with it, the SQL processor might assume that each value in a STRING column always contains 32 KB of data, and performance might be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: run the SET HADOOP PROPERTY command, or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.

Description

EXTERNAL

Indicates that the data in the table is not managed by the database manager. If you drop the table, the table definition is removed from both the database manager and Hive catalogs, but the data remains unaffected. This is typically used in conjunction with the LOCATION keyword to define a table with existing data. If a location is not specified for an explicitly defined external table, which is implicitly external (see the usage note on the application of EXTERNAL implicitly), the data is stored in the default warehouse location even if a nonstandard location is specified for the schema in which the table is created. Hive applies this data placement.

HADOOP

This keyword is required when you are defining a Hadoop table. You can verify that you have defined a HADOOP table by querying the PROPERTY column in the SYSCAT.TABLES catalog view. For example:

SELECT SUBSTRING(PROPERTY,22,1)
  FROM SYSCAT.TABLES
  WHERE TABNAME='<table-name>';

IF NOT EXISTS

Checks whether the specified table name already exists. If the name exists, no error is returned.

table-name

Specifies a unique name for the new table. The table name must be a valid identifier.

You can use the RENAME statement to rename a Hadoop table. When you rename a Hadoop table, the name in both the local database catalog and the Hive catalog is updated. When you rename a managed Hadoop table, which is a table that is created without the EXTERNAL keyword, the distributed file system storage directory is also updated.

element-list

Defines the elements of a table, including the definition of columns and constraints on the table.

column-definition

Defines the attributes of the columns for a new Hadoop table that is not derived from another table by using the AS clause.

column-name

Names a column of the table that is not created by using the AS clause. The name cannot be qualified, and the same name cannot be used for more than one column of the table. The column name must be a valid identifier.

data-type

Specifies the data type of the column.

primitive-type

There are several built-in data types:

TINYINT

A 1-byte integer with a range of -128 to 127.

Db2 Big SQL automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.

SMALLINT | INT2

A 2-byte integer with a precision of 5 digits and a range of -32,768 to 32,767.

INTEGER | INT | INT4

A 4-byte integer with a precision of 10 digits and a range of -2,147,483,648 to +2,147,483,647.

BIGINT | INT8

An 8-byte integer with a precision of 19 digits and a range of -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.

A single-precision or double-precision floating-point number.

A single-precision floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.4028234663852886e+38 to -1.1754943508222875e-38, or from 1.1754943508222875e-38 to 3.4028234663852886e+38.

A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.

A REAL number is a single-precision floating-point number. A DOUBLE number is a double-precision floating-point number.

CHARACTER(integer) | CHAR(integer)

A character string is a sequence of characters. The length of the string (integer) is the number of characters in the sequence. The default is 1.

STRING | VARCHAR(integer) | CHARACTER VARYING(integer) | CHAR VARYING(integer)

The STRING data type has no specific size associated with it. The default is VARCHAR(32672).

TIMESTAMP | DATETIME

A six- or seven-part value (year, month, day, hour, minute, second, and optional fractional seconds) that designates a date and time.

If specified, integer must be between 0 and 9 and represents the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).

DATE

A three-part value (year, month, and day) that designates a date. The range of the year part is 0001 to 9999. The range of the month part is 1 to 12. The range of the day part is 1 to x, where x depends on the month.

DECIMAL | DEC | NUMERIC | NUM (precision-integer, scale-integer)

A decimal number with defined precision and scale. The precision is the total number of digits and can range from 1 to 31. The scale is the number of digits to the right of the decimal point and can range from 0 to the precision of the number. The default precision is 10 and scale is 0.

BINARY | VARBINARY (integer)

A varying length binary value is specified as VARBINARY(length) or BINARY(length) and can be up to 32,672 bytes long. VARBINARY is an alias for BINARY, and they are functionally identical in every way. If no length is provided, the length is assumed to be 32,672.

However, it is recommended that an explicit length be specified. The Db2 Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.

With this data type, the Java™ I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Db2 Big SQL.

BOOLEAN

A Boolean value.

array-type | row-type

For detailed information about these types, see Array values and Row values.

Each ARRAY or ROW type can be used in one of two ways: the SQL standard version or the Hive version. The SQL standard version, which is shown in the syntax diagram, is the recommended usage. When you specify the number of elements in the array as part of the table definition, the database manager can compute the maximum size of the column, which enhances query optimization.

With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Db2 Big SQL.

ARRAY

An ARRAY is a list that can be indexed. The elements in the array must be of the same type. Assuming that column C1 is declared as an ARRAY with two or more elements, you can access these elements by using the following notation: ...WHERE C1[2]=1.... An ARRAY of ROW types is valid.

integer-constant: For simple arrays, an integer constant that indicates the scale of the array, which is the number of elements in the array.
INT | INTEGER | VARCHAR (int): The data type of the associative array index.

ROW

A ROW object contains a value for each attribute of the SQL structured type that it represents (java.sql.Struct). The ROW type is equivalent to the STRUCT type in Hive tables. The ROW type contains field definitions that contain field names and their data types. An entry is created in the SYSIBM.SYSATTRIBUTES table for each field in a ROW definition.

field-name: Specifies the name of a field for the ROW type. The name cannot be the same as another field of this ROW type (SQLSTATE 42711).
data-type: Specifies the data type of the field. The data type must be a primitive data type.

column-options

You can define column options including a comment, the nullability attribute, or constraints.

COMMENT column-comment

A comment that provides additional information about the column.

NULL | NOT NULL

NOT NULL specifies that the column will not contain null values. Because the Hadoop table is external to any local database server, Db2 Big SQL enforces nullability at query execution time (SQL5104N). The nullability attribute of a column is enforced during an INSERT or LOAD USING operation. Do not assign a null value to a column that is defined as NOT NULL (SQLSTATE 23502).

CONSTRAINT constraint-name

Specifies the name of a constraint. You can define the following constraints on a column:

PRIMARY KEY

A primary key must be unique and must have the NOT NULL attribute. A table cannot have more than one primary key.

UNIQUE

A unique key prevents duplicate values in one or more NOT NULL columns of a table. A table can have multiple unique keys.

REFERENCES table-name

Ensures that required table relationships are maintained and that data entry rules are followed. The column on which this parameter is specified becomes the foreign key to the referenced parent table table-name.

CHECK check-condition

Sets restrictions on data that is added to a specific table. This parameter is enforced during INSERT operations, but is not enforced during a LOAD HADOOP operation, or on data that is made available outside of the database manager (by using Hive operations, for example).

search-condition

The search-condition must be true or unknown for every row of the table, and has the following restrictions:

A column reference must pertain to a column of the table that is being created.
The search condition cannot contain a TYPE predicate.
The search condition cannot contain any of the following items (SQLSTATE 42621):
- Subqueries
- CAST specifications with a SCOPE clause
- Column functions
- Functions that are not deterministic
- Functions that are defined to have an external action
- User-defined functions that are defined with either MODIFIES SQL or READS SQL DATA
- OLAP specifications
- Global variables
- References to generated columns other than the identity column
- An error tolerant nested table expression

Check constraints with search-condition are enforced when rows in the table are inserted or updated. A check constraint that is defined on a table automatically applies to all subtables of that table.

functional-dependency

Defines a functional dependency between columns or sets of columns.

The parent set of columns precedes the DETERMINED BY clause, and the child set of columns follows the DETERMINED BY clause. All restrictions on search-condition apply to both parent set and child set columns, and only simple column references are allowed in the column sets (SQLSTATE 42621). The same column must not be identified more than once in a functional dependency (SQLSTATE 42709). No column in the child set of columns can be a nullable column (SQLSTATE 42621).

A functional dependency is not enforced by the database manager during normal operations such as insert, update, or delete. The functional dependency might be used during query rewrite for optimization. Incorrect results might be returned if the integrity of a functional dependency is not maintained.

constraint-attributes

Defines attributes that are associated with primary key, unique, referential integrity, or check constraints.

NOT ENFORCED | ENFORCED

Specifies whether the constraint is enforced by the database manager during normal operations such as insert, update, or delete.

NOT ENFORCED

Specifies that the constraint is not enforced by the database manager. This is the default for Hadoop tables. A primary key or unique constraint cannot be NOT ENFORCED if there is a dependent ENFORCED referential constraint.

TRUSTED | NOT TRUSTED

Specifies whether the data can be trusted to conform to the constraint.

TRUSTED: Specifies that the data can be trusted to conform to the constraint. This is the default. Use this option only if the table data is independently known to conform to the constraint; otherwise, unpredictable results might occur.
NOT TRUSTED: Specifies that the data cannot be trusted to conform to the constraint. This option is valid only for referential integrity constraints (SQLSTATE 42613). Use this option when the data conforms to the constraint for most rows, but it is not independently known whether all current and future rows will conform. If a constraint is both not trusted and enabled for query optimization, it will not be used for any optimization that depends on the data being in complete conformity with the constraint.

ENFORCED

Although the ENFORCED keyword is available for compatibility with the syntax for local tables that are created with the CREATE TABLE statement, an error is returned when you specify this option for Hadoop tables (SQLSTATE 42858).

ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION

Specifies whether the constraint or functional dependency can be used for query optimization under appropriate circumstances.

ENABLE QUERY OPTIMIZATION: Specifies that the constraint is assumed to be true and can be used for query optimization. This is the default.
DISABLE QUERY OPTIMIZATION: Specifies that the constraint cannot be used for query optimization. This option cannot be specified for primary key and unique constraints (SQLSTATE 42613).

column-name

Names a column of the source table when you create a new Hadoop (target) table by using the AS clause. If a column name is not specified, column names for the target table are derived from the result set of a query against the source table. You cannot explicitly specify the data types of these columns, because the types are derived from data types in the result set (example 20).

unique-constraint

Defines a unique key. This constraint does not allow duplicate values in one or more columns of the table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL.

referential-constraint

Defines a referential constraint.

CONSTRAINT constraint-name: Specifies a name for the referential constraint.
FOREIGN KEY (column-name, ...): The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the parent table, and the same column must not be identified more than once. The number of foreign key columns must be the same as the number of parent key columns, and the data types of the corresponding columns must be compatible (SQLSTATE 42830).

check-constraint

Defines a check constraint.

constraint-name: Specifies a name for the check constraint.
check-condition: See CHECK check-condition.

LIKE table-name | url

table-name

Specifies that the table columns are to have exactly the same name and description as the columns of a table, view, or nickname that exists in the catalog, or a declared temporary table. The table name can refer only to another Hadoop table (SQLSTATE 42858). You cannot specify a typed table or view (SQLSTATE 428EC), and the Hadoop table that is referenced in table-name cannot contain an ARRAY or ROW column (SQLSTATE 428H2).

url

Db2 Big SQL 7.2 and later service This feature requires the Db2 Big SQL 7.2 and later service on IBM Cloud Pak for Data 4.0 and later.

Specifies that the table columns are to have a name and type that are most similar to columns in a file at the specified URL. If the given URL is a directory, a file is selected at random from within that directory.

The STORED AS clause specifies the file format. The format must be one of ORC, PARQUET, or PARQUETFILE. Db2 Big SQL 7.4 service This feature requires the Db2 Big SQL 7.4 service on IBM Cloud Pak for Data 4.6. This feature now supports the TEXTFILE and JSONFILE file formats.
If the selected file is stored in a directory structure that adheres to Hive's partition naming convention, the partitioning columns are inferred from the directory's components. All inferred partitioning columns have a default type of VARCHAR(1024). Otherwise, you can specify or override partitioning columns by using the PARTITIONED BY clause.
By default, the newly created table is stored separately from the given source URL. However, you can use the LOCATION clause to store the table at the same URL.
If the type of a column is not supported in Db2 Big SQL, it is assigned the BINARY type.
Columns of type STRING are mapped to VARCHAR(N) where N is the value specified in the bigsql.string.size property. If this property is not set, N is 1024.
The file formats of the source and target tables must be the same.

COMMENT string-constant

Specifies a comment that provides additional information about the table. The maximum length of string-constant is 254 bytes.

PARTITIONED BY partitioning-element-list

Specifies how the data is to be partitioned. The action of including the PARTITIONED BY clause creates a directory in the distributed file system that contains data for each unique value in the column that is being partitioned. At query time, Db2 Big SQL tries to scan only those directories, or partitions, that contain data matching the query's search criteria.

column-definition

Specifies one or more columns by which data that is inserted into the table is partitioned. The definition includes the column name, the column data type, and any column options. This column definition does not apply to tables that are created by using the AS clause. The following data types cannot be used as partitioning columns:

ARRAY
BINARY
BOOLEAN
DOUBLE
FLOAT
MAP
REAL
ROW
STRUCT
TIMESTAMP
VARBINARY

Important: A partitioning key value cannot include any trailing blank spaces. This could be a problem if you plan to partition on a CHAR data type, because these values are always padded to their full length with blank spaces. Db2 Big SQL also does not support partitioning on a zero-length string.

column-name

Names a column of the source table that identifies a partitioning column that is to be used when creating a new (target) table or MQT by using the AS clause. The data type of the partitioning column is derived from the result set of a query against the source table. If you specify a list of column names, the list must contain as many names as there are columns in the result set. Each column name must be unique and unqualified.

partition-expression

Specifies one or more partitioning expressions by which data that is inserted into the table is partitioned. The action of including the PARTITIONED BY clause creates a directory in the distributed file system (DFS) that contains data for each unique value in the partitioning expression. At query time, Db2 Big SQL tries to scan only those directories, or partitions, that contain data that match the search criteria of the query.

The expression can reference one or more columns in the columns list. The AS clause specifies a name for the column that is derived from the expression. The data type of this identifier matches the result type of the expression. If a query references the column that is derived from the expression, a new predicate is automatically injected into the query so that Db2 Big SQL scans only the qualifying partitions. The data type of the result of the expression cannot be one of the following types:

ARRAY
BINARY
BOOLEAN
DOUBLE
FLOAT
MAP
REAL
ROW
STRUCT
TIMESTAMP
VARBINARY

The partitioning expression cannot contain any of the following items (SQLSTATE 42621):

Subqueries
Column functions
Dereference operations or DEREF functions
User-defined or built-in functions that are non-deterministic
User-defined functions that use the EXTERNAL ACTION option
User-defined functions that are not defined with NO SQL
Host variables or parameter markers
Special registers or built-in functions that depend on the value of a special register
Global variables
An expression on a partitioning expression column
An expression that uses a JSON function

Important: A load Hadoop operation does not automatically generate column values when a table is partitioned by a partition-expression. When using the LOAD HADOOP statement on a table that is partitioned by a partition-expression, ensure that the column values are generated in advance and stored in a file. The INSERT statement, on the other hand, does automatically generate column values when a table is partitioned by a partition-expression. Therefore, if values for the new column must be automatically generated, use the INSERT statement instead of the LOAD HADOOP statement. If all column values are generated in advance, use the LOAD HADOOP statement.

ROW FORMAT

Specifies the format of data rows.

delimited-row-format

Specifies a delimiter at the table level.

DELIMITED

Valid values for the FIELDS TERMINATED BY and ESCAPED BY parameters are the following items.

Important: It is recommended to use different values for each of these parameters; otherwise, the behavior is undefined.

A literal single-byte character such as vertical bar (|) or comma (,).
A value of the format \nnn, where nnn is the octal value of a character. For example, \001 is the ASCII character that represents CTRL-A. When you specify a delimiter as an octal value (\nnn), nnn must be in the range of 0 to 177, which is the range for lower ASCII characters.
A value of the format \xYY, where YY is a two-digit hexadecimal value of a character. This representation is retained for compatibility with Db2 Big SQL v1; however, the literal character or octal representations are recommended.
You can also specify a delimiter as a decimal integer value in the range of -128 to 127. For example, the vertical bar (|) character is represented either by octal value \174 or decimal integer value 124.
To specify a higher (or extended) ASCII character, you can use a negative decimal value. Negative values are subtracted from 256 to get the extended ASCII code value. For example, you can specify a section sign character (§, ASCII code 245) as -11.

FIELDS TERMINATED BY terminator-char: Specifies a delimiter for structured fields. You can include any single character, but the default is \001, the ASCII character that represents CTRL-A. Characters that match the field terminator character can be part of your data if you specify the ESCAPED BY clause and those characters are properly escaped.
COLLECTION ITEMS TERMINATED BY terminator-char: Specifies a delimiter for the individual items in a collection. You can include any single character, but the default is \002, the ASCII character that represents CTRL-B.
LINES TERMINATED BY terminator-char: Specifies a delimiter for data rows. The only valid delimiter is a newline character (\n). Newline characters must not exist within your data because they cannot be escaped.
MAP KEYS TERMINATED BY terminator-char: Specifies a delimiter for map keys. A map is a simple set of key/value pairs in a list. It is a collection, so the collection items terminator enables you to split out each map entry. However, within each map entry, the map keys terminator separates the key from its value. For example, pf#500$epf#200 represents two map entries. The collection items terminator is $ and the map keys terminator is #.
NULL DEFINED AS null-value: Specifies a value to represent the null value. The default is a literal backslash character followed by an uppercase N (\N).

serde-row-format

Specifies the format of data rows as a SerDe class.

SERDE serde-class: Specifies the name of a Java class in the Db2 Big SQL server CLASSPATH that implements the Hive SerDe interface org.apache.hadoop.hive.serde2.SerDe.
WITH SERDEPROPERTIES ('string-constant1' = 'string-constant2'): Associates SerDe properties with the SerDe class. For example, if you want to eventually load HBase data into a Hive table, create the table by using this option and the hbase.columns.mapping parameter.

To use a SerDe JAR file, complete the following steps to copy the JAR file to the appropriate libraries:

Stop the Db2 Big SQL server.
Copy the JAR file to $BIGSQL_HOME/userlib and $HIVE_HOME/lib.
Restart the Db2 Big SQL server.

STORED AS file-format

Specifies the format of the underlying data files. Db2 Big SQL can read most available file storage formats. For more information about the supported file formats, see File formats supported by Db2 Big SQL.

You can also specify how the data is stored as Java input and output classes.

INPUTFORMAT class-name: Specifies the name of a Java class, such as hadoop.input.format.class.name, in the Db2 Big SQL server CLASSPATH that implements the Hadoop input format interface, such as org.apache.hadoop.mapred.InputFormat.
Important: In Hadoop, a record is defined by some input format in a data source. An input format refers to a piece of code that specifies how the files are to be read. When a Db2 Big SQL table is created, the server chooses an appropriate input format that is based on how the table is created. An input format can also be explicitly specified by naming a Java class that contains an implementation of the InputFormat interface of Hadoop. The Java class can turn a record into a row, which can determine how the data within a record corresponds to the columns in the definition of a table.
OUTPUTFORMAT class-name: Specifies the name of a Java class, such as hadoop.output.format.class.name, in the Db2 Big SQL server CLASSPATH that implements the Hadoop output format interface, such as org.apache.hadoop.mapred.OutputFormat.

STORED BY storage-handler

Specifies the name of a Java class in the Db2 Big SQL server CLASSPATH that implements the storage handler interface. For example, you can use org.apache.hadoop.hive.ql.metadata.HiveStorageHandler for the Hive interface. Use org.apache.hadoop.hive.hbase.HBaseStorageHandler for the HBase interface.

LOCATION string-constant

Specifies that a particular distributed file system directory must be used to store data files. Access to the path that is specified in string-constant is controlled by the HDFS file permissions. Access can also be granted.

Note: The LOCATION clause must always specify a directory that contains the files for the table that is being created. The files must have the same structure. If you have files with different structures and you want to create a different table for each structure, the files must be in separate directories.

Db2 Big SQL provides configuration parameters to enforce the requirement that the LOCATION clause must be used with the EXTERNAL clause. The action of these configuration parameters ensures that the DROP statement does not accidentally drop data that is outside of the hive\warehouse directory.

For more information, see Using S3 Object Storage with Db2 Big SQL tables.

TBLPROPERTIES ('string-constant1' = 'string-constant2')

Defines job properties that can configure input or output formats, and that you can use to specify more detailed information about the table. The properties are passed as is to the underlying InputFormat and OutputFormat classes. Properties that are not implemented by these classes are ignored. Db2 Big SQL does not validate the properties.

SORT BY column-name

Specifies one or more columns by which data that is inserted into the table is sorted. Use this clause to improve query performance with some file formats (such as ORC).

as-result-table

Specifies that the table is to be based on a SELECT statement. The table name cannot refer to a Hadoop table that contains an ARRAY or ROW column (SQLSTATE 428H2).

AS (fullselect)

Specifies that, for each column in the derived result table of the fullselect, a corresponding column is to be defined for the table. Each defined column adopts the following attributes from its corresponding column of the result table (if applicable to the data type):

Column name
Column description
Data type, length, precision, and scale
Nullability

The following attributes are not included (although the default value and identity attributes can be included by using copy-options):

Default value
Identity attributes
Hidden attribute
ROW CHANGE TIMESTAMP
Any other optional attributes of the tables or views that are referenced in the fullselect

The following restrictions apply:

Every select list element must have a unique name (SQLSTATE 42711). The AS clause can be used in the select clause to provide unique names.
The fullselect cannot refer to host variables or include parameter markers.
The data types of the result columns of the fullselect must be data types that are valid for columns of a table.
If row or column level access control (RCAC) is activated for any table that is specified in the fullselect, RCAC is not cascaded to the new table.
The fullselect cannot include a data-change-table-reference clause (SQLSTATE 428FL).
Any valid fullselect that does not reference a typed table or a typed view can be specified.

WITH NO DATA

Specifies that the target table is not to be populated with data.

DATA INITIALLY DEFERRED

Specifies that data is not inserted into the table as part of the CREATE TABLE (HADOOP) statement.

REFRESH DEFERRED

Specifies that the table will populated with data at a later time.

MAINTAINED BY USER

Specifies that the data in the materialized query table is maintained by the user. The user is allowed to perform insert or truncate operations against user-maintained materialized query tables. The REFRESH TABLE statement cannot be run against user-maintained materialized query tables.

ENABLE QUERY OPTIMIZATION

Specifies that the materialized query table can be used for query optimization.

DISABLE QUERY OPTIMIZATION

Specifies that the materialized query table is not used for query optimization. The table can still be queried directly. This is the default.

Usage notes

The maximum number of Hadoop tables that can be created in Db2 Big SQL is 1,048,576.
By default, the HADOOP keyword is a required element of the CREATE TABLE statement and is recommended. If the HADOOP keyword is not specified and the query is ambiguous, the source table determines the table type. For clarity, it is recommended that you specify the HADOOP keyword.
One default consequence of running the CREATE TABLE (HADOOP) statement is that GRANT statements are applied for a connected user in Hive. If Hive impersonation is disabled, you must complete the following steps to change the hive.security.metastore.authorization.manager property in the hive.xml file:
1. In the Cloudera Manager UI, click Clusters > Hive.
2. Select the Configurations tab and search for hive-site.
3. In Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml, enter a new property with name hive.security.metastore.authorization.manager and value org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider.
4. Click Save Changes.
5. Click Stale Configuration: Restart needed and then Restart Stale Services.
If the current schema, which might have been created with either the USE or SET SCHEMA command, does not exist in the Hive metastore, an attempt is made to automatically create the schema in Hive.
All referential integrity constraints and nullability indicators are advisory and are not enforced at data load time.
When you use data types such as STRING and BINARY, the SQL processor might assume that such cells always contain 32 KB of data, which can impact performance. Instead, use data types that have a specific length. To improve performance, use a VARCHAR(n) with a defined size instead of a STRING. Alter the table to define an appropriate length for the column by using the following syntax:
```
ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
```
This alter operation changes the definition of the table both in Db2 Big SQL and in Hive. If data in the VARCHAR(n) column exceeds <size>, a null value is returned when that data is queried.
You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways:
- Run the SET HADOOP PROPERTY command.
- Set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(9), which is compatible with the maximum precision of a TIMESTAMP value in Db2 Big SQL or Hive.
An ARRAY or ROW column cannot be compared directly with another ARRAY or ROW column (SQLSTATE 42818). The elements in an ARRAY or ROW can be compared only to another expression of a primitive type. For example, assume that A and B are ARRAY columns.
- The following statement is not allowed:
```
...where a = b...
```
- The following statement is valid if the element type is a primitive type:
```
...where a[1] = b[1]...
```
Now assume that C and D are ROW columns.
- The following statement is allowed:
```
...where c.field1 = d.field1...
```
If you create a table in a schema that does not exist, the schema is automatically created with the default name (your user ID).
The columns and data types for an Avro table are fixed at the time that you run the CREATE TABLE (HADOOP) statement. If you create the table with an EXTERNAL schema by using the avro.schema.url configuration property and then change the schema, the change is not immediately recognized. This change can result in query execution errors.
To ensure that your changes are recognized, you must drop the local table from the Db2 Big SQL catalog, and then import the table from the Hive catalog. Use the following commands to drop and import the table:
```
SET SYSHADOOP.CATALOG_SYNC_MODE=0;
DROP TABLE avro_table;
SET SYSHADOOP.CATALOG_SYNC_MODE=1;  
CALL HCAT_SYNC_OBJECTS('schema1', 'avro_table');
```

CMX compression is supported in Db2 Big SQL. To enable CMX compression, set the following compression properties before running the CREATE TABLE (HADOOP) statement:


SET HADOOP PROPERTY 'mapred.output.compress'='true';
SET HADOOP PROPERTY 'mapred.output.compression.type'='BLOCK';
SET HADOOP PROPERTY 'mapred.output.compression.codec'='org.apache.hadoop.io.compress.CmxCodex';

The following table shows the data type mappings between types that are supported in the CREATE TABLE (HADOOP) syntax, the corresponding types in the Hive catalog, and the native SQL types that Db2 Big SQL uses at run time:

Table 1. Data type mappings
CREATE TABLE (HADOOP) data type ¹	Hive data type ²	SQL data type ³
BIGINT	BIGINT	BIGINT
BINARY	BINARY (starting with Hive 0.8.0)	VARBINARY
BOOLEAN	BOOLEAN	BOOLEAN
CHAR	CHAR	CHAR
DATE	DATE	DATE
DECIMAL	DECIMAL (introduced in Hive 0.11.0 with a precision of 38 digits)	DECIMAL
DOUBLE	DOUBLE	DOUBLE
FLOAT	DOUBLE	FLOAT
INT	INT	INT
REAL	FLOAT	REAL
SMALLINT	SMALLINT	SMALLINT
STRING	STRING	VARCHAR(32672)
TIMESTAMP	TIMESTAMP	TIMESTAMP(9)
TIMESTAMP(`n`)	TIMESTAMP	TIMESTAMP(`n`)
TINYINT	TINYINT	SMALLINT
VARBINARY	BINARY	VARBINARY
VARCHAR	VARCHAR	VARCHAR
This column lists the data types that you can specify in the CREATE TABLE (HADOOP) statement. This column lists the corresponding data types as defined in the Hive metastore. For example, a column that is defined as FLOAT in the statement is listed as a DOUBLE column in the Hive catalogs, and table values for that column must conform to Hive (or SerDe) rules for DOUBLE values. This column lists the corresponding native Db2 Big SQL data types. When such a type interacts with other columns through a SELECT or INSERT statement, for example, the data must be treated as the Big SQL type.

If Db2 Big SQL is reinstalled, existing Hadoop MQTs are resynchronized from Hive and are no longer defined as Hadoop MQTs. These tables must be dropped and recreated; otherwise, the following error is returned:
```
SQL0159N The statement references an object that identifies 
an unexpected object type. Object: "<TABLE NAME>". Object type: "NON MQT". 
Expected object type: "MQT". SQLSTATE=42809
```
Hadoop tables, HBase tables, and local Db2 tables all use the same namespace. If you create tables in Hive that are synchronized to Db2 Big SQL, ensure that these tables do not conflict with local Db2 tables in Big SQL.
When a table is created in Db2 Big SQL by using the CREATE TABLE (HADOOP) statement, Db2 Big SQL in turn creates a table on the Hive metastore. Hive tables that are implicitly created by Db2 Big SQL, however, are not Hive managed tables. To create a table, Db2 Big SQL runs a CREATE EXTERNAL TABLE statement rather than a CREATE TABLE statement in Hive. But unlike what happens with Hive tables that are explicitly created as EXTERNAL, when a Big SQL table is dropped, Db2 Big SQL deletes the data that was associated with the table, as you would expect with a managed Hive table. You can use output from the db2look utility to determine whether a table is an external table or a Hive managed table. For example, consider the following statement:
```
CREATE HADOOP TABLE foo(c1 INT) STORED AS ORC;
```
Running db2look against this statement returns the following output, which in this example shows that the table is an external table. The table property 'external.table.purge'='true' indicates that the data that is associated with the table will be deleted when the table is dropped.
```
CREATE EXTERNAL HADOOP TABLE "BIGSQL  "."FOO"(
    C1 INT
)
ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
    INPUTFORMAT
        'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
    OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
    'hdfs:///warehouse/tablespace/external/hive/bigsql.db/foo'
TBLPROPERTIES(
    'bucketing_version'='2',
    'external.table.purge'='true'
);
```
The following syntax alternatives are supported for compatibility with other database products:
- You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type. For example, you can specify ARRAY<VARCHAR(8)> instead of VARCHAR(8) ARRAY[ ].
- You can use the MAP syntax that Hive uses to define an associative ARRAY data type. For example, you can specify MAP<VARCHAR(10), DOUBLE> instead of DOUBLE ARRAY [VARCHAR(10)].
- You can use the STRUCT syntax that Hive uses to define a ROW data type. For example, you can specify STRUCT<F1:INTEGER, F2:DOUBLE> instead of ROW(F1 INTEGER, F2 DOUBLE).

Restrictions

The table-name in the same schema must not already exist in the Hive catalog, regardless of case differences. For example, the second table definition is not valid:
```
CREATE HADOOP TABLE "Foo" (c1 INT); 
CREATE HADOOP TABLE "FOO" (c1 INT);
```
The ENFORCED constraint attribute cannot be applied to Db2 Big SQL Hadoop tables (SQLSTATE 42858).
The following restrictions apply to the ARRAY and ROW data types:
- The first column of a table cannot be defined with the ARRAY or ROW data type.
- Use of the ARRAY or ROW data type in a key definition or constraint is not supported (SQLTATE 42962).
- Columns that are defined with the ARRAY or ROW data type cannot be used as the anchor object of an anchored type (SQLSTATE 428HS).
- Referencing a table that contains a column of type ARRAY or ROW in the AS or LIKE clause is not supported (SQLSTATE 428H2).
- The maximum number of ARRAY or ROW definitions in a database is 32763 (SQLSTATE 54035). This limit also includes any user-defined types that use the CREATE TYPE statement.
- Db2 Big SQL ARRAY and ROW types are not compatible with SQL PL ARRAY and ROW types.
- Use of an ARRAY type as a type within an ARRAY type for Hadoop tables is not supported.
- The result type of a column in a fullselect cannot be ARRAY.
The use of scrollable cursors for Hadoop tables is not supported.

Examples

Create a simple Hadoop table.


CREATE HADOOP TABLE t1 (
    c1 INT,
    c2 VARCHAR(20),
    c3 DOUBLE,
    c4 REAL
  );

Create a table with an underlying delimited text file.


CREATE HADOOP TABLE t (
    i INT,
    s VARCHAR(10)
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  STORED AS TEXTFILE;

Create tables with underlying BINARY and TEXT sequence files.


CREATE HADOOP TABLE seq_tab (
    c1 INT NOT NULL,
    c2 VARCHAR(20)
  )
  STORED AS BINARY SEQUENCEFILE;


CREATE HADOOP TABLE t (
    i INT,
    s VARCHAR(10)
  )
  STORED AS TEXT SEQUENCEFILE;

Use the LazySimpleSerDe class to create a table.


CREATE HADOOP TABLE t (
    i INT,
    s VARCHAR(10)
  )
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  STORED AS SEQUENCEFILE;

Create a table with underlying Parquet files.


CREATE HADOOP TABLE t (
    i INT,
    s VARCHAR(10)
  )
  STORED AS PARQUETFILE;

Create an Avro table with an inline Avro schema.


CREATE EXTERNAL HADOOP TABLE bs_rev_profit_by_compaign (
    Revenue DOUBLE,
    GrossProfit DOUBLE,
    CompaignName VARCHAR(10)
  )
  COMMENT 'A table backed by Avro data with the Avro schema
  embedded in the CREATE TABLE statement'
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  
  STORED AS
    INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION '/user/biadmin/avro/sheet/'
  TBLPROPERTIES (
    'avro.schema.literal'='{"type":"record",
      "name": "TUPLE_3",
      "fields": 
      [ { "name": "Revenue","type": [ "null","double" ],
      "doc": "autogenerated from Pig Field Schema"},
      {"name":"GrossProfit","type": [ "null", "double" ],
      "doc": "autogenerated from Pig Field Schema"}, 
      {"name": "CompaignName","type": [ "null", "string" ],
      "doc":"autogenerated from Pig Field Schema"} 
      ]
    }
  ')
;

Create a table with a partitioning column.


CREATE HADOOP TABLE orders (
    orderkey INT NOT NULL,
    o_comment VARCHAR(20) COMMENT 'comment column' NOT NULL
    ) 
  PARTITIONED BY (orderdate VARCHAR(10));

Create a table with a partition that is based on a partitioning expression on a column.


CREATE HADOOP TABLE orders (
    orderkey INT NOT NULL,
    o_comment VARCHAR(20) COMMENT 'comment column' NOT NULL,
    orderdate DATE
    ) 
  PARTITIONED BY (MONTH(orderdate) AS order_month);

Create a table that is like the following table:


CREATE HADOOP TABLE employee (
    empno INT,
    fname VARCHAR(100),
    lname VARCHAR(100),
    dept VARCHAR(100)
  )
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n';

Convert the data to a binary sequence file, which can improve performance.


CREATE HADOOP TABLE employeeBINSEQ (
    empno,
    fname,
    lname,
    dept
  )
  STORED AS BINARY SEQUENCEFILE
  AS (SELECT * FROM employee);

Create a table by using the structure of another table, but without using any of the data from the source table.
```
CREATE HADOOP TABLE t1 (
    c1,
    c2
  ) 
  AS (SELECT x1, x2 FROM t2) WITH NO DATA;
```

Create a table with complex data types (STRUCT and ARRAY).


CREATE HADOOP TABLE rf_del1 (
    id VARCHAR(10),
    c1 STRUCT<f1:varchar(10), f2:varchar(10)>,
    c2 ARRAY<varchar(10)>
  )
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  COLLECTION ITEMS TERMINATED BY '/'
  LINES TERMINATED BY '\n';

Create a Hadoop table that can ingest HBase data.


CREATE HADOOP TABLE staff_hbase (
    id SMALLINT,
    name VARCHAR(100),
    dept SMALLINT,
    job VARCHAR(10),
    years SMALLINT,
    salary VARCHAR(10),
    comm VARCHAR(10)
  ) 
  STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  WITH SERDEPROPERTIES (
    'hbase.columns.mapping' = ':key,info:name,info:dept,role:job,role:years,money:salary,money:comm');

For more information, see Loading HBase data into a Hive table.

Create a simple table with an array of integers whose maximum cardinality is 10.
```
CREATE HADOOP TABLE mytable (
  id INT, col1 INT ARRAY[10]);
```
Select the value of the second element in the array.
```
SELECT col1[2]
  FROM mytable;
```
Return all values that are based on a particular value in COL1.
```
SELECT id
  FROM mytable
  WHERE col1[8]=12;
```

Create a simple table with a STRUCT column.


CREATE HADOOP TABLE mytable (
  id INT, col1 STRUCT<c1:INT, c2:DOUBLE>
  );

Select element C1 from COL1.


SELECT col1.c1
  FROM mytable;

Return all values for ID and COL.C2 that are based on a particular value in COL1.


SELECT id, col1.c2
  FROM mytable
  WHERE col1.c1 = 5;

Create a table with the ROW data type.


CREATE HADOOP TABLE mytable (
  id INT,
  col1 ROW(c1 INT,c2 DOUBLE)
  );

Select field C1 from COL1.


SELECT col1.c1
  FROM mytable;

Create a table with the MAP data type, an associative array of integer values that can be indexed by a VARCHAR(20) key.
```
CREATE HADOOP TABLE mytable (
  id INT, col1 INT ARRAY[VARCHAR(20)]
  );
```
Select an element value from the MAP where the index value is 'MY_STRING'.
```
SELECT col1['MY STRING']
  FROM mytable;
```

Create a view from a Hadoop table named TESTVIEW01 that has a column of type ARRAY.


CREATE HADOOP TABLE testview01 (
  id INT,
  arr TINYINT ARRAY[5]
  );

CREATE VIEW view001(i,c1) AS
  SELECT id, arr
    FROM testview01
    WHERE arr[1] < 8;

Select the value of the first element in the array.


SELECT c1[1]
  FROM view001;

Create a table T2 with data types that are different from those in the source table T1.


CREATE HADOOP TABLE t1 (
  c1 INT,
  c2 INT);

CREATE HADOOP TABLE t2 (
  c1, c2) AS (
  SELECT DOUBLE(c1), DOUBLE(c2)
    FROM t1);

Create a Hadoop MQT.


CREATE HADOOP TABLE hadoopmqt AS (
  SELECT c1, c2
    FROM t1)
  DATA INITIALLY DEFERRED
  REFRESH DEFERRED
  MAINTAINED BY USER
  DISABLE QUERY OPTIMIZATION;

Create a partitioned Hadoop MQT.


CREATE HADOOP TABLE hadooppartitionedmqt
  PARTITIONED BY (c2) AS (
  SELECT c1, c2
    FROM t1)
  DATA INITIALLY DEFERRED
  REFRESH DEFERRED
  MAINTAINED BY USER
  DISABLE QUERY OPTIMIZATION;