Data types that are supported by Db2 Big SQL for Hadoop and HBase tables

Db2® Big SQL supports a number of data types for Hadoop and HBase tables.

Table 1. Data types for Hadoop and HBase tables
Data type supported in CREATE TABLE (HADOOP/HBASE)	Db2 Big SQL data type	Notes®
ARRAY	ARRAY	For more information, see ARRAY.
BIGINT	64-bit signed integer	INT8 is a synonym for BIGINT.
BINARY	BINARY	BINARY is a synonym for VARBINARY. If BINARY is used, it will be treated as though VARBINARY were specified. If a length is not specified, 32,672 characters is assumed. It is recommended that an explicit length be specified.
BOOLEAN	BOOLEAN	For more information, see Boolean values.
CHAR(`n`)	Up to 255 bytes
DATE	Date	For more information, see DATE.
DECIMAL(p,s)	DECIMAL	For more information, see DECIMAL.
DOUBLE	64-bit IEEE float
FLOAT	64-bit IEEE float	FLOAT8 is a synonym for FLOAT. Change to REAL in DDL to keep Db2 Big SQL 1.0 behavior. For more information, see FLOAT and REAL.
INTEGER	32-bit signed integer	INT4 is a synonym for INTEGER.
REAL	32-bit IEEE float	FLOAT4 is a synonym for REAL. For more information, see FLOAT and REAL.
SMALLINT	16-bit signed integer	INT2 is a synonym for SMALLINT.
STRING	32,672 bytes	Use of the STRING type in Db2 Big SQL is discouraged because of the resulting increase in storage requirements and the potential for reduced performance. See Data types and performance. See also STRING.
STRUCT	STRUCT	For more information, see ROW.
TIMESTAMP	Date and time with 9 digits of fractional time	DATETIME is a synonym for TIMESTAMP.
TINYINT	16-bit signed integer	For more information, see TINYINT.
[VAR]BINARY(`n`)	[VAR]BINARY(`n`)	BINARY is a synonym for VARBINARY. If BINARY is used, it will be treated as though VARBINARY were specified. If a length is not specified, 32,672 characters is assumed. It is recommended that an explicit length be specified.
VARCHAR(`n`)	Up to 32,672 bytes

To learn about ways in which applications can use Db2 Big SQL data types, see Understanding data types.

ARRAY

The ARRAY type can be an ordinary ARRAY or an associative ARRAY. You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type, or the MAP syntax that Hive uses to define an associative ARRAY data type, but the SQL standard is recommended.

DATE

When a Db2 Big SQL HBase or Hadoop table is created with a column of type DATE, this column is mapped to the Hive DATE type.

A date is a three-part value (year, month, and day):

The range of the month part is 1 - 12.
The range of the day part is 1 - x, where x is 28, 29, 30, or 31, and depends on the month.
The range of the year part is 0001 to 9999.

If you are using the ORC file format with historical DATE values (values prior to 1582-10-15), and the ORC files were inserted with Db2 Big SQL 6.0 or Hive on HDP 3.0, use tblproperties('orc.proleptic.gregorian.default' = 'true') in the CREATE TABLE (HADOOP) statement or the ALTER TABLE (HADOOP/HBASE) statement. If the ORC files were inserted with Db2 Big SQL 5.0.4 (or earlier) or Hive on HDP 2.6.5 (or earlier), use tblproperties('orc.proleptic.gregorian.default' = 'false') or leave it unset. False is the default setting for orc.proleptic.gregorian.default.

DECIMAL

The DECIMAL data type is a numeric data type with fixed scale and precision. The precision represents the total number of digits that can be represented by the column. Therefore, +12.34 and -1.234 are both defined with a precision of 4. The scale is the number of fractional digits, and must be less than or equal to the precision.

The maximum precision value is 31. Therefore, the largest integral value is represented by DECIMAL(31,0) and contains the value 9 repeated 31 times. The most precise fractional value between 0 and 1 or between 0 and -1 is represented by DECIMAL(31,31), with 31 digits to the right of the decimal point. The closest value to 0 has 30 zeros and a 1 following the decimal point. The closest value to 1 has 31 nines following the decimal point.

Important: Hive supports a DECIMAL data type with a maximum precision value of 38. Db2 Big SQL tolerates a DECIMAL(38,0) data type in CREATE TABLE and ALTER TABLE statements for both Hadoop and HBase tables, and maps it to a DECIMAL(31,0) data type. This toleration support is limited to DECIMAL(38,0) only. Db2 Big SQL does not handle data that exceeds a precision value of 31. When the definition of a Hive table is imported into Db2 Big SQL (by using the HCAT_SYNC_OBJECTS stored procedure or automatic metastore synching), any column that is defined as DECIMAL(38,0) is changed to DECIMAL(31,0). If any data in the table exceeds a precision of 31, a null value is returned.

Db2 Big SQL converts between DECIMAL and other numeric types. For example, a DECIMAL value with zero scale is converted to the smallest appropriate integer type.

FLOAT and REAL

Db2 Big SQL defines a REAL data type as a 32-bit floating point value. The definition of a FLOAT data type is a synonym for DOUBLE.

In Hive, a FLOAT always refers to a 32-bit floating point value, which corresponds to the JAVA FLOAT data type. In Db2 Big SQL, the REAL data type is a synonym for the Hive FLOAT data type.

To migrate Hive and Db2 Big SQL applications, complete the following actions:

For CREATE HADOOP TABLE statements, change all columns defined as FLOAT to REAL to retain the same storage characteristics.
When you access Hive or Db2 Big SQL tables, FLOAT values are treated as DOUBLE values when read, which can produce minor rounding differences.

You can see precision errors when you do operations with the DOUBLE data type column and DECIMAL literals in a table defined with ROW FORMAT DELIMITED. As a best practice, use values in an E-notation format for the double data type column (+1.14000000000000E+000). You can use the ROUND function to work around some precision errors.

ROW

The ROW type contains field definitions that contain a field name and a data type. You can use the STRUCT syntax that Hive uses to define a ROW data type, but the SQL standard is recommended.

STRING

The STRING data type is mapped directly to Hive STRING. The native Db2 Big SQL data type is VARCHAR(max). This has several implications:

If a value in the underlying storage exceeds the maximum VARCHAR length, that value is silently truncated when converted to the Db2 Big SQL native data type.
If the underlying data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then Db2 Big SQL allocates unnecessary resources for the handling of that column. To improve performance, use a VARCHAR(n) with a defined size instead of a STRING. Alter the table to define an appropriate length for the column by using the following syntax:
```
ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
```
This alter operation changes the definition of the table both in Db2 Big SQL and in Hive. If data in the VARCHAR(n) column exceeds <size>, the data is truncated when queried.

The default behavior for the STRING data type is to map the type to the SQL data type of VARCHAR(32762). As suggested above, the default behavior can lead to performance issues. You can use the configuration property, 'bigsql.string.size' to change the size of the VARCHAR that is used to represent the STRING type. For example, set the following property before you create a Hadoop table that contains a STRING data type:

SET HADOOP PROPERTY 'bigsql.string.size'=4096;

The result of this property change is that all STRING types will be represented by the SQL data type VARCHAR(4096).

TIMESTAMP

The TIMESTAMP data type is a six- or seven-part value (year, month, day, hour, minute, second and, optionally, fractional seconds). See DATE data type for more information on the date portion of the timestamp (year, month, day). The TIME portion, is a three-part value (hour, minute, and second) designating a time of day under a 24-hour clock:

The range of the hour part is 0 - 23.
The range of the minute and second parts is 0 - 59.

The number of digits in the fractional seconds part is specified by using an attribute whose value is between 0 and 9 (default 9).

Timestamps are considered to be time zone-less. For convenience, functions are provided for converting values to and from specific time zones (TO_UTC_TIMESTAMP, FROM_UTC_TIMESTAMP).

If you are using the ORC file format with historical TIMESTAMP values (values prior to 1582-10-15), and the ORC files were inserted with Db2 Big SQL 6.0 or Hive on HDP 3.0, use tblproperties('orc.proleptic.gregorian.default' = 'true') in the CREATE TABLE (HADOOP) statement or the ALTER TABLE (HADOOP/HBASE) statement. If the ORC files were inserted with Db2 Big SQL 5.0.4 (or earlier) or Hive on HDP 2.6.5 (or earlier), use tblproperties('orc.proleptic.gregorian.default' = 'false') or leave it unset. False is the default setting for orc.proleptic.gregorian.default.

TINYINT

The TINYINT type in the context of the CREATE TABLE (HADOOP) statement is supported only to maintain compatibility with Hive. It is unavailable in other contexts. TINYINT is mapped to the Db2 SMALLINT type, but the values are constrained (-128 to 127) to match the Hive definition of a TINYINT.

Data types and performance

The Db2 Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation. The STRING data type should be avoided because it maps internally to VARCHAR(32672), which means that the table rows will usually exceed 32 KB in size. You can avoid this performance degradation by changing any STRING data type references to explicit VARCHAR(n) types that best match the data size.

If the maximum size of a STRING or VARCHAR column is known, the optimal way to handle this column is to explicitly alter the table to define an appropriate length for the column by using the following syntax:

ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)

This alter operation changes the definition of the table both in Db2 Big SQL and in Hive. If data in the VARCHAR(n) column exceeds <size>, a null value is returned when that data is queried.