Data types that are supported by Big SQL for Hadoop and HBase tables
Big SQL supports a number of data types for Hadoop and HBase tables.
Data type supported in CREATE TABLE (HADOOP/HBASE) | Big SQL data type | Notes |
---|---|---|
ARRAY | ARRAY | For more information, see ARRAY. |
BIGINT | 64-bit signed integer | INT8 is a synonym for BIGINT. |
BINARY | BINARY | BINARY is a synonym for VARBINARY. If BINARY is used, it will be treated as though VARBINARY were specified. If a length is not specified, 32,672 characters is assumed. It is recommended that an explicit length be specified. With this data type, the Java™ I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL. |
BOOLEAN | Treated as SMALLINT | For more information and examples, see BOOLEAN. |
CHAR(n) | Up to 32,672 characters | |
DATE | Date | For more information, see DATE. |
DECIMAL(p,s) | DECIMAL | For more information, see DECIMAL. |
DOUBLE | 64-bit IEEE float | |
FLOAT | 64-bit IEEE float | FLOAT8 is a synonym for FLOAT. Change to REAL in DDL to keep Big SQL 1.0 behavior. For more information, see FLOAT and REAL. |
INTEGER | 32-bit signed integer | INT4 is a synonym for INTEGER. |
REAL | 32-bit IEEE float | FLOAT4 is a synonym for REAL. For more information, see FLOAT and REAL. |
SMALLINT | 16-bit signed integer | INT2 is a synonym for SMALLINT. |
STRING | 32,672 characters | See Data types and performance. See also STRING. |
STRUCT | STRUCT | For more information, see ROW. |
TIMESTAMP | Date and time with 7 digits of fractional time | |
TINYINT | 16-bit signed integer | For more information, see TINYINT. |
[VAR]BINARY(n) | [VAR]BINARY(n) | BINARY is a synonym for VARBINARY. If BINARY is used, it will be treated as though VARBINARY were specified. If a length is not specified, 32,672 characters is assumed. It is recommended that an explicit length be specified. With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL. |
VARCHAR(n) | Up to 32,672 characters |
To learn about ways in which applications can use Big SQL data types, see Understanding data types.
ARRAY
The ARRAY type can be an ordinary ARRAY or an associative ARRAY. You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type, or the MAP syntax that Hive uses to define an associative ARRAY data type, but the SQL standard is recommended.
BOOLEAN
The BOOLEAN data type is mapped to the Hive BOOLEAN data type. However the native SQL representation is a SMALLINT, where the value 1 represents true, and 0 represents false.
To be compatible with Hive and Big SQL, the BOOLEAN type is supported in the CREATE TABLE (HADOOP) statement.
SELECT COUNT(*) FROM T1 WHERE BOOL_COL =
true;
to a Big SQL query:
SELECT COUNT(*) FROM T1 WHERE BOOL_COL = 1;
CREATE EXTERNAL HADOOP TABLE booltab (c1 int, c2 boolean)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/path/to/booltab.csv'
;
The
file in '/path/to/booltab.csv' contains the following
data:
1,true
2,false
3,false
4,true
5,fred
The values that are contained in
the table are all valid Hive representations of a BOOLEAN with the exception of the Row 5
(5,fred). When you query the table, the values are represented as a SMALLINT
value with the exception of Row 5. It appears as a NULL
value:
SELECT * FROM booltab;
+---+------+
| 1 | 1 |
+---+------+
| 2 | 0 |
+---+------+
| 3 | 0 |
+---+------+
| 4 | 1 |
+---+------+
| 5 | NULL |
+---+------+
When
you query the table, the value of the BOOLEAN column must always be treated as a SMALLINT. For
example, the following query results in an
error:SELECT * FROM booltab WHERE c2 = true
[State: 42818][Code: -401]: The data types of the operands for the
operation "=" are not compatible or comparable
Change
the query to the following statement to get valid
output:
SELECT * FROM booltab WHERE c2 = 1;
+---+---+
| 1 | 1 |
+---+---+
| 4 | 1 |
+---+---+
In
an INSERT statement, the same rules apply. Therefore, the following INSERT statement results in an
error:
INSERT INTO booltab VALUES (6, false);
[State: 42821][Code: -408]: A value is not compatible with the data type of
its assignment target. Target name is "C2".
Change
the INSERT statement to the following
query:INSERT INTO booltab VALUES (6, 0);
The example INSERT statement adds a
new "false" value into the table. However, if you view the stored representation
of the data that was produced by the operation, it appears as the following
data:6,true
Remember, that the above INSERT stores the data into a table that
was defined as a Hive ROW FORMAT DELIMITED. The storage for a true and false value is
"true" and "false". The way that you define the storage type
at the time you create the table makes a difference in the SQL type, and in the way the data is
represented. SELECT * FROM tableBIGSQL WHERE bool_col = 1;
SELECT * FROM tableBIGSQL1.0 WHERE bool_col = true;
DATE
When a Big SQL HBase or Hadoop table is created with a column of type DATE or type DATE STORED AS TIMESTAMP, this column is mapped to the Hive TIMESTAMP type. If a column is created with type DATE STORED AS DATE, this column is mapped to the Hive DATE type. If a table is created in Hive with a column of type DATE, Big SQL treats this column as a DATE STORED AS DATE column. This means that the Hive DATE column is mapped to the Big SQL DATE type by default.
When Hive and the Java I/O engine write historical date and timestamp values (values before the year 1583), the values are converted from the Gregorian calendar to the Julian calendar. Because the C++ I/O engine does not do this conversion, historical date and timestamp values in your tables can differ by up to ten days, depending on how the data is inserted and how it is read. The ORC file format, which is the only file format that uses the Java I/O engine for both read and write operations, is recommended when you are working with historical dates.
If you use DATE STORED AS DATE in a column specification, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see Big SQL readers and writers and File formats supported by Big SQL.
The range of the year part is 0001 to 9999.
DECIMAL
The DECIMAL data type is a numeric data type with fixed scale and precision. It is used in CREATE HADOOP TABLE and ALTER HADOOP TABLE statements. The precision represents the total number of digits that can be represented by the column. Therefore, +12.34 and -1.234 are both defined with a precision of 4. The scale is the number of fractional digits, and must be less than or equal to the precision.
The maximum precision value is 31. Therefore, the largest integral value is represented by DECIMAL(31,0) (999... with 9 repeated 31 times). The most precise fractional value (between 0 and 1, or 0 and -1) is represented by DECIMAL(31,31), with 31 digits to the right of the decimal point. The value closest to 0 would be .0000...1 (30 zeros and the final 1). The value closest to 1 would be .999... (9 repeated 31 times).
Big SQL converts between DECIMAL and other numeric types. For example, a DECIMAL with a zero scale is converted to the smallest appropriate integer type.
FLOAT and REAL
Big SQL defines a REAL data type as a 32-bit floating point value. The definition of a FLOAT data type is a synonym for DOUBLE.
In Hive, a FLOAT always refers to a 32-bit floating point value, which corresponds to the JAVA FLOAT data type. In Big SQL, the REAL data type is a synonym for the Hive FLOAT data type.
- For CREATE HADOOP TABLE statements, change all columns defined as FLOAT to REAL to retain the same storage characteristics.
- When you access Hive or Big SQL tables, FLOAT values are treated as DOUBLE values when read, which can produce minor rounding differences.
You can see precision errors when you do operations with the DOUBLE data type column and DECIMAL literals in a table defined with ROW FORMAT DELIMITED. As a best practice, use values in an E-notation format for the double data type column (+1.14000000000000E+000). You can use the ROUND function to work around some precision errors.
ROW
The ROW type contains field definitions that contain a field name and a data type. You can use the STRUCT syntax that Hive uses to define a ROW data type, but the SQL standard is recommended.
STRING
- If a value in the underlying storage exceeds the maximum VARCHAR length, that value is silently truncated when converted to the Big SQL native data type.
- If the underlying data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then Big SQL allocates unnecessary resources for the handling of that column. You should use an explicit VARCHAR instead of a STRING to improve performance.
SET HADOOP PROPERTY 'bigsql.string.size'=4096;
The result of this
property change is that the STRING type will be represented with the SQL data type of VARCHAR(4096).
TIMESTAMP
The TIMESTAMP data type is a six- or seven-part value (year, month, day, hour, minute, second and, optionally, fractional seconds). The number of digits in the fractional seconds part is specified by using an attribute whose value is between 0 and 9 (default 9). Timestamps are considered to be time zone-less. For convenience, functions are provided for converting values to and from specific time zones (TO_UTC_TIMESTAMP, FROM_UTC_TIMESTAMP).
TINYINT
The TINYINT type in the context of the CREATE TABLE (HADOOP) statement is supported only to maintain compatibility with Hive and Big SQL. Values are constrained (-128 to 127) to match the Hive definition of a TINYINT.
Data types and performance
The Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation. The STRING data type should be avoided because it maps internally to VARCHAR(32672), which means that the table rows will usually exceed 32 KB in size. You can avoid this performance degradation by changing any STRING data type references to explicit VARCHAR(n) types that best match the data size.