Identifier handling for Datalake tables

The identifier handling of Datalake tables is more restrictive than regular Db2 identifiers since these tables are maintained in both the Db2 catalog tables and the Hive Metastore (HMS). As a result, the identifier semantics are limited to a subset of what is supported in both Db2 and Hive.

For more information about the handling of Db2 identifiers, see Identifiers.

The most obvious difference in identifier handling between Hive and Db2 is that even delimited identifiers in Hive are case-insensitive and they are treated as if written in lower-case. For instance, the typical Db2 table names `AaAa` and `aAaA` are both identified as the same Datalake table named `aaaa`. This means that there might be occasions where a SQLSTATE 42710 error occurs when creating Datalake tables due to case sensitivity either in the table name or the schema name. Consider the following example that creates the MYTABLE table under the MySchema schema:

create datalake table "MySchema"."MYTABLE" (c1 int) location 'db2remote://odfdefault//default/MYTAB'

resulting in the following return:

DB20000I The SQL command completed successfully.

Now, let's run the CREATE Datalake TABLE statement again to create another table (TABLEXX), but change the case of the schema name to Myschema as in the following example:

create datalake table "Myschema"."TABLEXX" (c1 int) location 'db2remote://odfdefault//default/MYTAB'

resulting in the following return:


DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0601N The name of the object to be created is identical to the existing
name "myschema" of type "SCHEMA". SQLSTATE=42710

The second run of the CREATE Datalake TABLE statement failed since the schema name differs only in case from the first run of the CREATE Datalake TABLE statement.

The following SELECT statement example shows you the schema and table names that are already in use as considered by Db2 and Hive:

select varchar(tabschema,25) as db2_schema, varchar(hiveschema,25) as hive_schema, varchar(tabname,25) as db2_tabname, varchar(hivetab,25) as hive_tabname from syshadoop.hcat_tables where upper(tabschema) = upper('myschema') and upper(tabname) = upper('mytab')

resulting in the following return:


DB2_SCHEMA                HIVE_SCHEMA               DB2_TABNAME               HIVE_TABNAME
------------------------- ------------------------- ------------------------- -------------------------
MySchema                  myschema                  MyTab                     mytab

1 record(s) selected.

From the SELECT output, we know that the only 'case' version of "myschema" that we can use is "MySchema".

A regular identifier in Hive is a sequence of one or more alphanumeric characters (A-Z, 0-9) and underscores which must not start with an underscore. A delimited identifier in Hive is a sequence of one or more UTF-8 characters enclosed by backticks. However, dependent on the context in which the identifier is used, only a subset of the UTF-8 character set is accepted. Leading and trailing spaces in the sequence are treated differently dependent on the context of the identifier. For example, in schema and table names, leading and trailing spaces are removed, but in column names the spaces are kept.

Schema and table names can contain the following supported characters: A-Z, a-z, 0-9, '"', '%', '&','\'', '(', ')', '*', '+', ',', '-', '/', ':', ';', '<', '=', '>', '?', '[', ']','_', '|', '{', '}', '$', '^', '!', '~', '#', '@', '', or ' '. If a schema name or table name contains an unsupported character, an error occurs. In addition, an error occurs if a schema and table name contains an embedded '.', or if a schema name starts with '@', or if a schema name ends with '#' or '!'.

In general, column names may contain any UTF-8 character. However, similar to schema and table names there are a few unsupported characters. Characters '.' and ':' are unsupported. Depending on the used storage handler and storage format, additional limitations exist:

Due to limitations in Hive, many special characters including blanks are not supported for column names for Hive Datalake table that uses parquet file format.
Only C identifiers (has to begin with a letter, then letters, digits, or underscores) can be used for column names for Hive Datalake table using Avro file format.

If you use an unsupported identifier, an SQL0113N error is returned.