Data types that are supported by Big SQL for Hadoop and HBase tables

Big SQL supports a number of data types for Hadoop and HBase tables.

Table 1. Data types for Hadoop and HBase tables
Data type supported in CREATE TABLE (HADOOP/HBASE)	Big SQL data type	Notes
ARRAY	ARRAY	For more information, see ARRAY.
BIGINT	64-bit signed integer	INT8 is a synonym for BIGINT.
BINARY	BINARY	BINARY is a synonym for VARBINARY. If BINARY is used, it will be treated as though VARBINARY were specified. If a length is not specified, 32,672 characters is assumed. It is recommended that an explicit length be specified. With this data type, the Java™ I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.
BOOLEAN	Treated as SMALLINT	For more information and examples, see BOOLEAN.
CHAR(`n`)	Up to 32,672 characters
DATE	Date	For more information, see DATE.
DECIMAL(p,s)	DECIMAL	For more information, see DECIMAL.
DOUBLE	64-bit IEEE float
FLOAT	64-bit IEEE float	FLOAT8 is a synonym for FLOAT. Change to REAL in DDL to keep Big SQL 1.0 behavior. For more information, see FLOAT and REAL.
INTEGER	32-bit signed integer	INT4 is a synonym for INTEGER.
REAL	32-bit IEEE float	FLOAT4 is a synonym for REAL. For more information, see FLOAT and REAL.
SMALLINT	16-bit signed integer	INT2 is a synonym for SMALLINT.
STRING	32,672 characters	See Data types and performance. See also STRING.
STRUCT	STRUCT	For more information, see ROW.
TIMESTAMP	Date and time with 7 digits of fractional time
TINYINT	16-bit signed integer	For more information, see TINYINT.
[VAR]BINARY(`n`)	[VAR]BINARY(`n`)	BINARY is a synonym for VARBINARY. If BINARY is used, it will be treated as though VARBINARY were specified. If a length is not specified, 32,672 characters is assumed. It is recommended that an explicit length be specified. With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.
VARCHAR(`n`)	Up to 32,672 characters

To learn about ways in which applications can use Big SQL data types, see Understanding data types.

ARRAY

The ARRAY type can be an ordinary ARRAY or an associative ARRAY. You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type, or the MAP syntax that Hive uses to define an associative ARRAY data type, but the SQL standard is recommended.

BOOLEAN

The BOOLEAN data type is mapped to the Hive BOOLEAN data type. However the native SQL representation is a SMALLINT, where the value 1 represents true, and 0 represents false.

To be compatible with Hive and Big SQL, the BOOLEAN type is supported in the CREATE TABLE (HADOOP) statement.

Storage of a BOOLEAN column is compatible with Hive and Big SQL. The value of a BOOLEAN type is physically represented as a SMALLINT that contains the value of 1 for true and 0 for false. When you query a BOOLEAN column, change the query from a Big SQL 1.0 query:

SELECT COUNT(*) FROM T1 WHERE BOOL_COL =
      true;

to a Big SQL query:

SELECT COUNT(*) FROM T1 WHERE BOOL_COL = 1;

Assume the following table definition:


CREATE EXTERNAL HADOOP TABLE booltab (c1 int, c2 boolean)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LOCATION '/path/to/booltab.csv'
;

The file in '/path/to/booltab.csv' contains the following data:


1,true
2,false
3,false   
4,true
5,fred

The values that are contained in the table are all valid Hive representations of a BOOLEAN with the exception of the Row 5 (5,fred). When you query the table, the values are represented as a SMALLINT value with the exception of Row 5. It appears as a NULL value:


SELECT * FROM booltab;
+---+------+
| 1 | 1    |
+---+------+
| 2 | 0    |
+---+------+ 
| 3 | 0    |
+---+------+
| 4 | 1    |
+---+------+
| 5 | NULL |
+---+------+

When you query the table, the value of the BOOLEAN column must always be treated as a SMALLINT. For example, the following query results in an error:

SELECT * FROM booltab WHERE c2 = true

 [State: 42818][Code: -401]: The data types of the operands for the
    operation "=" are not compatible or comparable

Change the query to the following statement to get valid output:


SELECT * FROM booltab WHERE c2 = 1;


+---+---+
| 1 | 1 |
+---+---+
| 4 | 1 |
+---+---+

In an INSERT statement, the same rules apply. Therefore, the following INSERT statement results in an error:


INSERT INTO booltab VALUES (6, false);

 [State: 42821][Code: -408]: A value is not compatible with the data type of
    its assignment target. Target name is "C2".

Change the INSERT statement to the following query:

INSERT INTO booltab VALUES (6, 0);

The example INSERT statement adds a new "false" value into the table. However, if you view the stored representation of the data that was produced by the operation, it appears as the following data:

6,true

Remember, that the above INSERT stores the data into a table that was defined as a Hive ROW FORMAT DELIMITED. The storage for a true and false value is "true" and "false". The way that you define the storage type at the time you create the table makes a difference in the SQL type, and in the way the data is represented.

The following examples show the difference between using the BOOLEAN data type in Big SQL and Big SQL 1.0:

SELECT * FROM tableBIGSQL WHERE bool_col = 1;

SELECT * FROM tableBIGSQL1.0 WHERE bool_col = true;

DATE

When a Big SQL HBase or Hadoop table is created with a column of type DATE or type DATE STORED AS TIMESTAMP, this column is mapped to the Hive TIMESTAMP type. If a column is created with type DATE STORED AS DATE, this column is mapped to the Hive DATE type. If a table is created in Hive with a column of type DATE, Big SQL treats this column as a DATE STORED AS DATE column. This means that the Hive DATE column is mapped to the Big SQL DATE type by default.

When Hive and the Java I/O engine write historical date and timestamp values (values before the year 1583), the values are converted from the Gregorian calendar to the Julian calendar. Because the C++ I/O engine does not do this conversion, historical date and timestamp values in your tables can differ by up to ten days, depending on how the data is inserted and how it is read. The ORC file format, which is the only file format that uses the Java I/O engine for both read and write operations, is recommended when you are working with historical dates.

If you use DATE STORED AS DATE in a column specification, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see Big SQL readers and writers and File formats supported by Big SQL.

The range of the year part is 0001 to 9999.

DECIMAL

The DECIMAL data type is a numeric data type with fixed scale and precision. It is used in CREATE HADOOP TABLE and ALTER HADOOP TABLE statements. The precision represents the total number of digits that can be represented by the column. Therefore, +12.34 and -1.234 are both defined with a precision of 4. The scale is the number of fractional digits, and must be less than or equal to the precision.

The maximum precision value is 31. Therefore, the largest integral value is represented by DECIMAL(31,0) (999... with 9 repeated 31 times). The most precise fractional value (between 0 and 1, or 0 and -1) is represented by DECIMAL(31,31), with 31 digits to the right of the decimal point. The value closest to 0 would be .0000...1 (30 zeros and the final 1). The value closest to 1 would be .999... (9 repeated 31 times).

Big SQL converts between DECIMAL and other numeric types. For example, a DECIMAL with a zero scale is converted to the smallest appropriate integer type.

FLOAT and REAL

Big SQL defines a REAL data type as a 32-bit floating point value. The definition of a FLOAT data type is a synonym for DOUBLE.

In Hive, a FLOAT always refers to a 32-bit floating point value, which corresponds to the JAVA FLOAT data type. In Big SQL, the REAL data type is a synonym for the Hive FLOAT data type.

To migrate Hive and Big SQL applications, complete the following actions:

For CREATE HADOOP TABLE statements, change all columns defined as FLOAT to REAL to retain the same storage characteristics.
When you access Hive or Big SQL tables, FLOAT values are treated as DOUBLE values when read, which can produce minor rounding differences.

You can see precision errors when you do operations with the DOUBLE data type column and DECIMAL literals in a table defined with ROW FORMAT DELIMITED. As a best practice, use values in an E-notation format for the double data type column (+1.14000000000000E+000). You can use the ROUND function to work around some precision errors.

ROW

The ROW type contains field definitions that contain a field name and a data type. You can use the STRUCT syntax that Hive uses to define a ROW data type, but the SQL standard is recommended.

STRING

The STRING data type is mapped directly to Hive STRING. The native Big SQL data type is VARCHAR(max). This has several implications:

If a value in the underlying storage exceeds the maximum VARCHAR length, that value is silently truncated when converted to the Big SQL native data type.
If the underlying data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then Big SQL allocates unnecessary resources for the handling of that column. You should use an explicit VARCHAR instead of a STRING to improve performance.

The default behavior for the STRING data type is to map the type to the SQL data type of VARCHAR(32762). As suggested above, the default behavior can lead to performance issues. You can use the configuration property, 'bigsql.string.size' to change the size of the VARCHAR that is used to represent the STRING type. For example, set the following property before you create a Hadoop table that contains a STRING data type:

SET HADOOP PROPERTY 'bigsql.string.size'=4096;

The result of this property change is that the STRING type will be represented with the SQL data type of VARCHAR(4096).

TIMESTAMP

The TIMESTAMP data type is a six- or seven-part value (year, month, day, hour, minute, second and, optionally, fractional seconds). The number of digits in the fractional seconds part is specified by using an attribute whose value is between 0 and 9 (default 9). Timestamps are considered to be time zone-less. For convenience, functions are provided for converting values to and from specific time zones (TO_UTC_TIMESTAMP, FROM_UTC_TIMESTAMP).

TINYINT

The TINYINT type in the context of the CREATE TABLE (HADOOP) statement is supported only to maintain compatibility with Hive and Big SQL. Values are constrained (-128 to 127) to match the Hive definition of a TINYINT.

Data types and performance

The Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation. The STRING data type should be avoided because it maps internally to VARCHAR(32672), which means that the table rows will usually exceed 32 KB in size. You can avoid this performance degradation by changing any STRING data type references to explicit VARCHAR(n) types that best match the data size.