CREATE TABLE (HBASE) statement

Use the CREATE HBASE TABLE statement to create HBase tables for the Hadoop environment.

Invocation

This statement is an executable statement that cannot be dynamically prepared.

Syntax

element-list

column-definition

data-type

primitive-type

array-type

row-type

map-type

column-options

constraint-choices

references-clause

constraint-attributes

hbase-column-mapping

hbase-column

hbase-salt-options

hbase-split-options

hbase-cfamily-opts

hbase-encoding

hbase-map-parameters

delimited-row-format

serde-row-format

as-result-table

Notes:

¹ Each clause can only be used once.
² The Big SQL database manager natively supports the data types that are described in Data types. However, there are some data types that are syntactically supported for compatibility with Hive, but are physically treated as different data types within the Big SQL runtime. For more information see the mapping of data types that are used in the CREATE TABLE statement.
³ If you enable COMPATIBILITY_MODE, CHAR acts as a synonym for VARCHAR when you create the table. Then, the system catalogs show the VARCHAR data type.
⁴ Because the STRING data type has no specific size associated with it, the SQL processor might assume that it needs to manipulate 32K of data in a column all the time. Be aware that performance can be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: issue the SET HADOOP PROPERTY command or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
⁵ Do not use the SPLITALGO or NUMREGION clauses with ADD SALT. Also, SPLITALGO and NUMREGION are mutually exclusive.

Description

EXTERNAL

The EXTERNAL parameter indicates that the table is already defined in HBase. In this case, Big SQL refers to that existing table. If the EXTERNAL parameter is not included on the HBase table definition, the HBase table is automatically created by Big SQL. When Big SQL refers to an external table, it does not validate that the structure of the external table matches the table definition in Big SQL until the time a query is run that uses the table.

HBASE

You must include the HBase keyword to define an HBase table. You can verify that you have defined an HBASE table by querying the PROPERTY column in SYSCAT.TABLES.

SELECT SUBSTRING(PROPERTY,24,1) FROM SYSCAT.TABLES WHERE TABNAME='<table-name>';

table-name

Specifies the name for the table that is to be created.

If the table name exists, this statement fails. However, if the table name exists and IF NOT EXISTS is specified, no error message is issued.

The table name in the table definition must be a valid identifier.

column-definition

The name, type, and comments of the column.

column-name

You must specify the list of columns that are to be created as non-partitioned columns in the table. Each column is specified as a column name and data type. All valid Big SQL data types are allowed.

The column name in the table definition must be a valid identifier.

data-type

Specifies the data type of the column.

primitive-type

The built-in data type. Use one of the following types:

TINYINT

A 1-byte integer data type, with the range of -128 to 127.

Hadoop automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.

SMALLINT

A small integer is a 2-byte integer with a precision of 5 digits. The range of small integers is -32,768 to 32,767

INTEGER or INT

A 4-byte integer with a precision of 10 digits. The range of large integers is -2,147,483,648 to +2,147,483,647.

BIGINT

An 8-byte integer data type with a precision of 19 digits. The range is -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.

FLOAT(integer)

A single-precision floating-point number, which is a 32-bit approximation of a real number. The number can be zero or can range from -3.4028234663852886e+38 to -1.1754943508222875e-38, or from 1.1754943508222875e-38 to 3.4028234663852886e+38. A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.

The REAL data type is a single precision floating-point number. The DOUBLE data type is a double-precision floating-point number.

DECIMAL (precision-integer, scale-integer) or DEC(precision-integer, scale-integer)

A decimal number. The first integer is the precision of the number; that is, the total number of digits; it can range from 1 to 31. The second integer is the scale of the number; that is, the number of digits to the right of the decimal point; it can range from 0 to the precision of the number.

The default value is precision 10 and scale 0. The words NUMERIC and NUM can be used as synonyms for DECIMAL and DEC.

CHARACTER or CHAR

A character string is a sequence of characters. The length of the string is the number of characters in the sequence. If the length is zero, the value is called the empty string, which is not the NULL value.

STRING

The STRING data type has no specific size associated with it. The default is a VARCHAR 32K.

VARCHAR

A VARCHAR value can be up to 32 672 bytes long. You can also use CHARACTER VARYING or CHAR.

TIMESTAMP

A TIMESTAMP is a six- or seven-part value (year, month, day, hour, minute, second, and optional fractional seconds) that designates a date and time.

The integer must be between 0 and 9 and specifies the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).

DATE

A DATE is a three-part value (year, month, and day) that designates a date. The range of the year part is 0001 to 9999. The range of the month part is 1 to 12. The range of the day part is 1 to x, where x depends on the month.

DATE STORED AS TIMESTAMP: Indicates that the DATE data type is mapped and stored as a Hive TIMESTAMP data type. This is the default.
DATE STORED AS DATE: Indicates that the DATE data type is mapped and stored as a Hive DATE data type. For optimal performance when using this type, be sure to use the ORC file format, because the Java™ I/O interface is used to access the DATE data type (see Big SQL readers and writers).

BINARY or VARBINARY

A binary value is a variable length sequence of bytes. The length of the value is the number of bytes that it contains.

A varying length binary value is specified as VARBINARY(length) or BINARY(length) and can be up to 32,672 bytes long (VARBINARY is an alias for BINARY, and they are functionally identical in every way). If no length is provided, the length is assumed to be 32,672.

However, it is recommended that an explicit length be specified. The Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.

With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.

array-type, row-type

You can define only 32,673 of ARRAY or ROW types for the Big SQL database (SQLSTATE 54035). For example, each instance of ARRAY or ROW consumes one of these entries, therefore a column of an ARRAY of ROW types consumes two entries. If you exceed the limit, you must drop a HADOOP table that contains a ROW or ARRAY defined as a column type.

Each ARRAY or ROW type can be used in one of two ways, the SQL standard version, or the Hive version. The SQL standard version, which is displayed in the syntax, is the recommended usage. When you use the length of the data type, which is the number of elements, as part of the table declaration, the database manager can take advantage of the fact that it can compute the maximum size of such a column, which optimizes the column.

For information about the usage restrictions, see Array values and Row values.

ARRAY

An ARRAY is an indexable list. The elements in the array must be in the same type. You can access the elements in the array by the following notation: ...WHERE C1[2]=1..., assuming that column C1 is declared as an ARRAY with two or more elements.

primitive-type or row-type

Specifies the data type of the field.

primitive-type: Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types
ROW: Specifies the ROW data type. An ARRAY of ROW types is valid.

ROW

A ROW object contains a value for each attribute of the SQL structured type that it represents.(java.sql.Struct). A ROW is equivalent to the STRUCT type in Hive tables. The ROW type contains field definitions that contain the field name and the data type.

For each field in a ROW definition, an entry is created in the SYSIBM.SYSATTRIBUTES table.

field-name

Specifies the name of a field within the ROW type. The name cannot be the same as any other field of this ROW type (SQLSTATE 42711).

data-type

Specifies the data type of the field.

primitive-type: Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types

column-options

You can define column options that include date, nullability, or constraints on the column.

COMMENT column-comment

Each column can contain a COMMENT to provide more information about the column.

NULL or NOT NULL

Each column can specify NOT NULL. The NOT NULL parameter signifies to the Big SQL query optimizer that the column will not contain NULL values. The HADOOP table is external to any local database server, so Big SQL enforces nullability at query execution time (SQL5104N).

NULL and NOT NULL options are used in the following ways:

The nullability of a column is used during certain query re-write optimizations.
The nullability of a column is enforced during the INSERT or LOAD USING statement. Do not assign a NULL value to a column define as NOT NULL (SQLSTATE 23502).

CONSTRAINT constraint-name

You can optionally identify each constraint with a name. For all constraints, the default is NOT ENFORCED.

The following constraints can be set on the column:

PRIMARY KEY

Provides a shorthand method of defining a primary key. A primary key must be unique, and must have the NOT NULL attribute. A table cannot have more than one primary key.

UNIQUE

Provides a shorthand method of defining a unique key. This rule forbids duplicate values in one or more columns within a table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL.

REFERENCES table-name

If you do not specify column names, then the primary key of the table is the assumed set of referenced columns.

By using this parameter, you can define required relationships between tables. The purpose of this parameter is to guarantee that table relationships are maintained and data entry rules are followed. The column on which this parameter is used becomes the FOREIGN KEY to the parent table that is referenced by the REFERENCES keyword.

CHECK check-condition

This parameter sets restrictions on data added to a specific table. It is enforced during INSERT operations, but is not enforced during a LOAD HADOOP, or on data that is made available outside of the database manager, such as by using Hive operations. CHECK constraints are useful to ensure that a column that contains interest rates never contains a value of zero.

The search-condition must be true or unknown for every row of the table.

search-condition

The search-condition has the following restrictions:

A column reference must be to a column of the table being created.
The search-condition cannot contain a TYPE predicate.
The search-condition cannot contain any of the following (SQLSTATE 42621):
- Subqueries
- CAST specifications with a SCOPE clause
- Column functions
- Functions that are not deterministic
- Functions defined to have an external action
- User-defined functions defined with either MODIFIES SQL or READS SQL DATA
- OLAP specifications
- Global variables
- References to generated columns other than the identity column
- An error tolerant nested-table-expression

Check constraints with search-condition are enforced when rows in the table are inserted or updated. A check constraint defined on a table automatically applies to all subtables of that table.

functional-dependency

Defines a functional dependency between columns.

column-name DETERMINED BY column-name or (column-name,...) DETERMINED BY (column-name,...)

The parent set of columns contains the identified columns that immediately precede the DETERMINED BY clause. The child set of columns contains the identified columns that immediately follow the DETERMINED BY clause. All of the restrictions on the search-condition apply to parent set and child set columns, and only simple column references are allowed in the set of columns (SQLSTATE 42621). The same column must not be identified more than once in the functional dependency (SQLSTATE 42709). No column in the child set of columns can be a nullable column (SQLSTATE 42621).

A functional dependency is not enforced by the database manager during normal operations such as insert, update, or delete. The functional dependency might be used during query rewrite to optimize queries. Incorrect results might be returned if the integrity of a functional dependency is not maintained.

constraint-attributes

These constraints can improve query performance. The default for Hadoop tables is NOT ENFORCED.

For primary key constraints, unique constraints, and check constraints, you can specify that the constraint can be trusted. For referential integrity constraints, if the constraint is not enforced, you can further specify whether the constraint can be trusted. A not-enforced and not-trusted constraint is also known as a statistical referential integrity constraint. You can specify whether a referential integrity constraint or check constraint is to be used for query optimization.

NOT ENFORCED or ENFORCED

Specifies whether the constraint is enforced by the database manager during normal operations such as insert, update, or delete.

NOT ENFORCED

This parameter is the default for Hadoop tables. The constraint is not enforced by the database manager. A primary key constraint or unique constraint cannot be NOT ENFORCED if there is a dependent ENFORCED referential constraint.

TRUSTED: The data can be trusted to conform to the constraint. TRUSTED must be used only if the data in the table is independently known to conform to the constraint. Query results might be unpredictable if the data does not actually conform to the constraint. This is the default option.
NOT TRUSTED: The data cannot be trusted to conform to the constraint. NOT TRUSTED is intended for cases where the data conforms to the constraint for most rows, but it is not independently known that all the rows or future additions will conform to the constraint. If a constraint is NOT TRUSTED and enabled for query optimization, then it will not be used to perform optimizations that depend on the data conforming completely to the constraint. NOT TRUSTED can be specified only for referential integrity constraints (SQLSTATE 42613).

ENFORCED

The ENFORCED keyword is syntactically available for compatibility with the syntax for local tables (created with CREATE TABLE). However the presence of ENFORCED results in an error (SQLSTATE 42858) that indicates the operation is not supported on Hadoop tables.

ENABLE QUERY OPTIMIZATION or DISABLE QUERY OPTIMIZATION

Specifies whether the constraint or functional dependency can be used for query optimization under appropriate circumstances. The default is ENABLE QUERY OPTIMIZATION.

ENABLE QUERY OPTIMIZATION: The constraint is assumed to be true and can be used for query optimization.
DISABLE QUERY OPTIMIZATION: The constraint cannot be used for query optimization. DISABLE QUERY OPTIMIZATION cannot be specified for primary key and unique constraints (SQLSTATE 42613).

unique-constraint

referential-constraint

Defines a referential constraint.

CONSTRAINT constraint-name

Names the referential constraint.

FOREIGN KEY (column-name, ...)

Defines a referential constraint with the specified constraint-name.

The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the object table and the same column must not be identified more than once. There must be the same number of foreign key columns as there are in the parent key and the data types of the corresponding columns must be compatible (SQLSTATE 42830).

check-constraint

Defines a check constraint. A check-constraint is a search-condition that must evaluate to not false or a functional dependency that is defined between columns.

constr-name: Names the check constraint.
check-condition: Defines a check constraint. The search-condition must be true or unknown for every row of the table .

COMMENT string-constant

A table comment as part of a table definition can be used to provide information about the table.

TBLPROPERTIES string-constant=string-constant

Table properties defines job properties that can configure input or output formats. They can be used to specify more detailed information about the table.

SECURITY POLICY policy-name

Names the security policy to be associated with the table. The policy-name identifies a security policy that already exists.

A table is not protected unless it has a security policy associated with it and it includes a column that is defined with a security label. Using a security label indicates that the table is a protected table with row level granularity.

COLUMN MAPPING hbase-column-mapping

This clause is required on all HBase tables. It defines the mapping between the columns that are defined in the element-list and the stored location in the HBase table. It also describes the encoding of values in the HBase column.

KEY MAPPED BY

Defines which columns are stored in the HBase row key. When a single column is specified for the row key, the column must be defined as NOT NULL, or an error is raised (SQL?????). When multiple columns are specified for the row key, they can contain a mix of either NULL or NOT NULL values. However, a row key cannot exist where all columns are null, or an error is aised (SQL?????).

Do not use an ARRAY or ROW data type as the row key.

column-name

The name of a column as defined in the element-list.

hbase-column MAPPED BY

Specifies the name of an HBase column into which a column-name is stored.

hbase-column-family

The column family in which the target HBase column lives. Specify a valid identifier. The column family and column qualifier mappings are case sensitive. Therefore, cf:cq is not the same as cf:CQ.

hbase-column-name

The column name within the hbase-column-family in which the column values are to be stored. You can specify either a string literal or an identifier.

FORCE KEY UNIQUE

When you specify this parameter, a unique value is automatically generated and appended into the encoded HBase row key. This parameter allows duplicate rows to be stored within the HBase table. The generated unique value is not visible from SQL. However, the value does consume an additional 128 bytes of storage for each row in the HBase table. This parameter applies to the KEY MAPPED BY keyword only.

ENCODING

Indicates how values are encoded into the HBase column.

delimited-row-format

Data is encoded as text delimited data by using the same encoding logic that is used for ROW FORMAT DELIMITED text file storage. After the data is encoded into a text value, it is converted to UTF-8 bytes during final storage.

Warning:

Delimited text encoding is not recommended for use in KEY mapping unless the data that is being stored is already in text format. Because the string representation of numeric values does not collate (for example “1” < “19” < “2”), Big SQL will not push down predicates on numeric data types encoded by using delimited encoding.
The default null definition is \N. If the separator you choose will get escaped by using the the same byte that is used to escape \N, the result will not be correct. To avoid this error, choose a separator that does not clash, or use the NULL DEFINED AS clause to define a different null representation. For example, if \b is chosen as separator, over-ride the default null representation.
```
cf1:cq1 MAPPED BY (c4,c5) SEPARATOR '\b' ENCODING string NULL DEFINED AS '
```
'

BINARY

Data is encoded in a binary format. The default is BINARY. BINARY is the recommended storage format, because of the following benefits:

It performs better when encoding and decoding data.
It ensures proper collation of values when they are used in the row key.
It generally requires less storage than STRING encoding.

USING serde-row-format

Data is encoded using a Hive SerDe class to encode and decode the columns that are contained in the column mapping.

Warning: Columns that are encoded with a SerDe have the following restrictions:

You cannot use these columns within a secondary index that is created by the CREATE INDEX statement.
The search criteria on these columns cannot be pushed into HBase for processing. This means that additional rows might be returned to Big SQL to be filtered there, rather than having the region server do the row filtering. Specifically, this means that a custom encoded HBase row key results in a full HBase table scan.

STRING

Using STRING encoding is equivalent to using DELIMITED encoding with no other options.

ADD SALT

By using this parameter, Big SQL adds a prefix to the row key. The prefix is based upon a hash function on the values that are contained within the row key. The ADD SALT parameter provides a way to attempt to scatter data that would otherwise be clustered and form “hot spots” of activity in the HBase region servers.

BUCKETS=salt-buckets

Specifies the number of buckets in which to put the prefixed row keys. By default, Big SQL uses the number of region servers that exist at the time of the CREATE TABLE as the number of buckets. The maximum number of buckets that are allowed is 32.

CLASS='class-name'

Specifies the name of the class that does the salting. By default, the ADD SALT option uses the Big SQL algorithm, com.ibm.biginsights.bigsql.hbasecommon.salt.BigSqlSalt.

You can use any class that extends the Big SQL salting algorithm, or you can define your own salting algorithm. Ensure that the class is available to Big SQL and that it extends com.ibm.biginsights.bigsql.hbasecommon.salt.BigSqlSalt. Use the following steps to make the class available to Big SQL:

As the bigsql user type the following command: $BIGSQL_HOME/bin/bigsql stop.
Add the JAR file that contains the salting class to the $BIGSQL_HOME/bigsql/userlib folder.
Type the following command: $BIGSQL_HOME/bin/bigsql start.

PROPERTIES

Specifies the properties that can used to configure a custom salting class, in the form of quoted name and value pairs. For example,


properties = ('padding'=2)

hbase-split-options

When you specify these options, the HBase table is created as pre-split regions across region servers.

SPLITS (split-point1, split-point2, ..., split-pointN)

Provides a starting value for the first column that is stored in the HBase row key at which a split point begins. The total number of regions that are created is the number of split keys plus one.

You can specify the values as string or numeric representations of a value of the data type of the first column that is contained in the HBase row key. You can see examples of splitting in the Examples section.

SPLITS_FILE file-path

Specifies a file containing split points. The file path must be a fully qualified distributed file system (DFS) path to a file that contains split point values. In the file, there must be one value for each line. The contents of this file are treated as if those values were provided to the SPLITS clause.

SPLITALGO [split-algorithm,num-region-server]

Use this parameter to include a split algorithm that evenly divides by the value of number-of-regions, the space of possible keys. This parameter is useful when the keys are relatively uniform. This parameter is mutually exclusive with NUMREGION. Do not use SPLITALGO with NUMREGION in the same statement. Do not use this parameter with the ADD SALT keyword.

You can select one of two algorithms that are included when you install Big SQL. Choose the algorithm that works best with your data:

'HexStringSplit': The format of a 'HexStringSplit' region boundary is any uniformly distributed hexadecimal value. This split algorithm uses hex strings as keys, which is easy to read. However, the results can take up more space.
'UniformSplit': By using this class, you divide the space of possible keys evenly. This algorithm is useful when the keys are nearly uniform random bytes. The results use less space, but they might not be as readable.

NUMREGION (range-start,range-end,number-of-regions)

Use this parameter to include a split algorithm that evenly divides the space of possible keys by the value of the number of regions. This parameter is useful when the keys are relatively uniform. This parameter is mutually exclusive with SPLITALGO. Do not use SPLITALGO with NUMREGION in the same statement. Do not use this parameter with the ADD SALT keyword.

You can specify the range-start and range-end values as a number or a string literal value of the first column in the HBase row key. The examples in the Examples section show the use of NUMREGION.

DEFAULT COLUMN FAMILY OPTIONS

This parameter specifies the HBase configuration options that are applied, by default, to all column families that are defined for the HBase table. You can define or override these options for individual column families by using the COLUMN FAMILY OPTIONS parameter.

COLUMN FAMILY OPTIONS (OPTION NAME (option-value) ...)

This parameter specifies the HBase configuration options that are applied to a specific column family that is used by the HBase table. You specify the options in the form of an option name and option value pair. You can omit the option value for boolean options, which have a true or false value, to indicate a true value. The following options are available in the COLUMN FAMILY OPTIONS parameter:

COMPRESSION

Use this option to compress files to reduce the amount of storage that is used. The default is NONE, which means no compression for each specified HBase family, even when no COLUMN FAMILY OPTIONS parameter is used.

The following compression algorithms are valid. The compression ratio and performance during encoding and decoding of each algorithm depends on your environment and use case:

NONE: Indicates that no compression is used when storing the column family. This value is the default.
GZ: Select this option to use GZIP compression.
SNAPPY: Select this option to use SNAPPY compression.
LZ4: Select this option to use LZ4 compression.

BLOOM FILTER

Bloom filters help identify whether a row and column combination exists in a block without loading it. Using bloom filters uses extra space. However, using bloom filters improves look-up times and helps reduce block cache churn by loading only blocks that are required.

NONE: Indicates that no bloom filter is maintained. This is the default value, even when you do not specify a COLUMN FAMILY OPTIONS parameter.
ROW: Indicates that the hash of the row is added to the bloom on each insert. Use this option when queries have predicates on row keys.
ROWCOL: Indicates that the hash of the row plus column family plus column family qualifier is added to the bloom on each key insert. Use this option when queries have column projections.

IN MEMORY

The possible values are false (the default) or true. A value of true indicates that data is to be kept in cache for the longest period possible to improve performance.

NO IN MEMORY

The NO IN MEMORY option is retained for backwards compatibility with prior Big SQL releases. Use the IN MEMORY option instead.

BLOCKCACHE

Blocks in HBase are a unit of storage for memory. The possible values for BLOCKCACHE are true or false. When you enable BLOCKCACHE, read performance can improve. For frequently accessed rows, enable BLOCKCACHE.

BLOCKSIZE

You specify this value in bytes. It is used to control how much data HBase is required to read from the storage files during retrieval and what is cached in memory for subsequent accesses.

DATA_BLOCK_ENCODING

This option defines in-cache key compression. By using this option, you can take advantage of the redundancy between sorted keys in an HFile block. You can choose to store only the differences between consecutive keys.

KEEP_DELETED_CELLS

You can enable this option so that deleted cells can be retrieved with a GET or SCAN operation, as long these operations have a time range specified that ends before the timestamp of any delete that would affect the cells. This allows for point in time queries even in the presence of deletes. The possible values for KEEP_DELETED_CELLS are true or false.

VERSIONS

This option is an integer value that represents the number of row versions to retain.

MIN_VERSIONS

The minimum number of row versions to keep for each column family. The default value is 0, which disables the option. You can use this parameter with the time to live (TTL) parameter to keep the last TTL seconds worth of data, and a minimum of MIN_VERSIONS.

REPLICATION_SCOPE

To minimize the affects of system failures, enable replication on the column family. The default value is 0, which means that the column family is not replicated. A value of 1 means that the column family is replicated.

TTL

This option defines a time to live (TTL) in seconds for the column family. HBase automatically deletes all versions of the row (including the most current) that have exceeded the specified time to live.

HBASE TABLE NAME

Specifies the name of the underlying HBase table. By using this clause, you can give a different name to a table in Big SQL. HBase table names are case sensitive. For example, if you specify HBASE TABLE NAME hbasetable1, but the actual HBase table name is HBaseTable1, the statement returns an error indicating that the HBase table does not exist.

If your HBase table names are similar, such as hbasetable1 and HBaseTable1 except for the case, then you can use this clause to differentiate among them in Big SQL. For example, if your HBase table is named tab1, you can refer to that table as tab1. A second HBase table, named TAB1, can be referred to as tab1_2 in Big SQL. If tab1 and TAB1 already exist in HBase, then they can be referenced with tab1 and tab1_2 in Big SQL by using the EXTERNAL clause.

If the HBASE table name is not specified, the name of the HBase table is schema.tablename where the schema is the schema in which the table is defined in Big SQL and the table name is the name of the table as defined in Big SQL. Both names will be transformed as needed, to conform to Hive identifier rules. For more information about Identifiers in Hive, see Identifiers.

If you create an external HBase table on top of an existing HBase table by using the HBASE TABLE NAME clause, make sure that you following these naming rules:


Characters for user names and group names must be 
lowercase a through z, 0 through 9, and _ (underscore) for names 
not starting with 0 through 9.

DEFAULT ENCODING

Specifies the default data encoding that is used by any column mapping that does not explicitly specify one (see the information on ENCODING). If you do not specify an encoding, the default encoding is BINARY, which is the optimal value for performance and correctness.

(column-name) AS (full-select) WITH NO DATA

Creates a table that is based on a SELECT statement. The table name cannot refer to an HBase table that contains an ARRAY or ROW column (SQLSTATE 428H2).

You can optionally rename the columns in the target table.

By using the WITH NO DATA clause, you can define the target table without populating the table.

Usage notes

When you use data types such as STRING and BINARY, you can cause the SQL processor to assume that it needs to manipulate 32K of data in a column all the time. This result can impact performance. Instead, use data types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways:
- Run the SET HADOOP PROPERTY command.
- Set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(9), which is compatible with a maximum precision of a TIMESTAMP in Hive.
If you create a table in a schema that does not exist, the schema is automatically created with the default name. The default name is your User ID.

Restrictions

Restriction: If only the /apps/hbase/data directory is encrypted, INSERT .. SELECT or bulk LOAD INTO HBASE tables does not work.

The following restrictions exist for the complex data types:

Do not use an ARRAY or ROW or STRUCT data type as the row key.
Do not use a ROW or STRUCT or ARRAY data type in a constraint or key definition (SQLTATE 42962).
Do not reference a table that contains a column of type ARRAY or ROW or STRUCT in the AS or LIKE clause (SQLSTATE 428H2).
There is a limit of 32,763 ARRAY or ROW or STRUCT definitions per database (SQLSTATE 54035). This limit also includes any user defined types that use the CREATE TYPE statement.
Comparisons or assignments between ARRAY or ROW or STRUCT variables or using ARRAY data types ar parameters or ARRAY as ROW parameters to a stored procedures are not supported with ARRAY columns in a Hadoop table.
Do not use an ARRAY data type as a type within an ARRAY type for Hadoop tables.
The result type of a column in a full-select cannot be of type ARRAY.

The use of scrollable cursors is not supported for HBase tables.

Examples

Create an HBase table with composite keys and dense columns:
In this example, the HBase table, mixed_encodings has 6 SQL columns, c1 to c6. The HBase table is created with one column family (cf1) with two columns (cq1 and cq2). The HBase row key is mapped to three SQL columns (composite key). It uses binary encoding because the ENCODING clause specifies binary. The HBase column, cf1:cq1 is mapped to two SQL columns (dense column). It is encoded as a string and uses '\b' as the separator character. The HBase column, cf1:cq2 is mapped to a single SQL column, c6, and is encoded as binary.
```
CREATE HBASE TABLE mixed_encodings (
  c1 int,c2 int,c3 int,c4 varchar(20),c5 varchar(40),c6 varchar(90)
 )
COLUMN MAPPING (
  KEY MAPPED BY (c1,c2,c3),
  cf1:cq1 MAPPED BY (c4,c5)
ENCODING DELIMITED 
FIELDS TERMINATED BY '\b',
cf1:cq2 MAPPED BY (c6) ENCODING BINARY
)
DEFAULT ENCODING BINARY
;
```
Create an HBase table where all columns are in the primary key:
In this example, you can see that an HBase table can be created where all of the columns for a table are contained in the HBase primary key. There are no HBase columns specified in this table creation.
```
CREATE HBASE TABLE allkey (
k1 VARCHAR(10), k2 VARCHAR(70),k3 VARCHAR(30)
)
COLUMN MAPPING (
  KEY MAPPED BY (k1, k2, k3)
);
```

Create an HBase table with column options:

CREATE HBASE TABLE colopt1 (
 akey VARCHAR(10),c1  VARCHAR(70),c2  VARCHAR(30)
 )
COLUMN MAPPING (
  KEY MAPPED BY (akey),
  cf1:c1 MAPPED BY (c1),
  cf2:c2 MAPPED BY (c2)
)
COLUMN FAMILY OPTIONS (
  cf1 COMPRESSION(GZ) BLOOM FILTER(ROW) IN MEMORY, cf2 NO IN MEMORY
);

Create an HBase table using SerDe encoding:

CREATE EXTERNAL HBASE TABLE tableName (
  c1 BIGINT, c2 INT, c3 INT
 )
COLUMN MAPPING ( 
 KEY MAPPED BY (c1),
 cf1:cq1 MAPPED BY (c2,c3)
 ENCODING USING SERDE 'com.example.SerDeClass'
 WITH SERDEPROPERTIES
  ('key1'='value1','key2'='value2')
);

Create an HBase table by using salting techniques:
This example shows how to use the salting techniques so that you can spread out the hot spots in the range. New records are split into multiple buckets, and each record goes to a different region in the cluster.
```
CREATE HBASE TABLE HBTABLE (
 c1 BIGINT,c2 INT,c3 INT
 )
COLUMN MAPPING (
  KEY MAPPED BY (c1),
  cf1:cq1 MAPPED BY (c2,c3)
 )
ADD SALT (
 BUCKETS = 3
 CLASS = 'com.sample.SimpleSalt'
 PROPERTIES = (‘padding’ = ‘2’)
);
```
This example implicitly creates four splits, which consist of the three buckets plus one.
Create an HBase table with default salt options:
This example defines a table with salt option with no parameters. Big SQL uses the default salting algorithm, which defines the number of buckets as the number of region servers that exist at the time the table is created.
```
CREATE HBASE TABLE mysalt1 (
 rowkey INT,c0 INT
) 
COLUMN MAPPING (
  KEY MAPPED BY (rowkey),
  f:q MAPPED BY (c0)
)
ADD SALT;
```
Create an HBase table with a combination of salting and splitting:
This example uses a combination of salting and explicit splitting to produce seven splits (the number of explicit splits plus one). You must know your data to use the split option. Big SQL does not validate your split information.
```
CREATE HBASE TABLE salt_and_explicit_split (
  rowkey VARCHAR(40),c0 INT
) 
COLUMN MAPPING (
  KEY MAPPED BY (rowkey), 
  f:q MAPPED BY (c0)
) 
ADD SALT (
 buckets=3
) 
SPLITS (
 '0a', '0n', '1a', '1n', '2a', '2n'
);
```

Create an HBase table by using a compound row key:

This example shows a compound row key with a numeric value and BINARY default encoding:

CREATE HBASE TABLE hbtable (
  c1 INT,c2 INT,c3 INT, 
  c4 VARCHAR(10),c5 VARCHAR(20),
  c6 VARCHAR(30)
)
COLUMN MAPPING (
  KEY MAPPED BY (c1,c2,c3),
  cf1:cq1 MAPPED BY (c4,c5)
  ENCODING DELIMITED 
  FIELDS TERMINATED BY '\b'
  cf1:cq2 MAPPED BY (c6) 
  ENCODING binary
 )
DEFAULT ENCODING BINARY 
NUMREGION(10,30,5);

The table is created with splits on C1. The NUMREGION parameter defines the starting point of the range as 10, and the ending point of the range as 30, with 5 regions in which to split. The regions for the hbtable table are shown in the following table:

Table 1. Example split regions
Region	Split on C1 values
1	Less than 10
2	10 - 15
3	16 - 21
4	22 - 29
5	30 and greater

Be aware that if C1 in the hbtable table contains character or string values, the split might not be as efficient when the ENCODING is BINARY. Match the encoding with the value type of the row key column for more efficient splits.

Table 2. Guidelines for splitting and using default encoding
Row key value	Default encoding
Numeric values	BINARY ENCODING
Character values	STRING ENCODING
String values	STRING ENCODING

Create an HBase table by using a custom SerDe:

The following example includes a custom SerDe. By using this SerDe, you can read in data that is stored as JSON objects and convert that data to a JSON string that can be understood by HBase. The actual content of your HBase table is from JSON documents:

CREATE HBASE TABLE post
(
   id		INT,
   title	STRING,
   content	STRING,
   created_on	TIMESTAMP,
   updated_on	TIMESTAMP,
   published	BOOLEAN
)
COLUMN MAPPING
(
   KEY      MAPPED BY(id),
   cf1:cq1  MAPPED BY(title, content, created_on, updated_on, published)
)
DEFAULT ENCODING USING SERDE 'com.ibm.biginsights.bigsql.serde.JSONSerDe';

The example insert is:

insert into post 
  values(1, 'Hello world!', 'First post!',
     '2013-01-01 01:01:01.001', '2013-01-01 01:01:01.001', true);

Splitting on string or numeric representations:


CREATE HBASE TABLE T1 (
  C1 INT, C2 VARCHAR(10), C3 VARCHAR(20)
)
 COLUMN MAPPING (
  KEY MAPPED BY (C1,C2),
  cf1:c1 MAPPED BY (C3)
  )
SPLITS (1000, 2000, 3000)
;


CREATE HBASE TABLE T1 (
  C1 DATE, C2 VARCHAR(10), C3 VARCHAR(20)
 )
COLUMN MAPPING (
  KEY MAPPED BY (C1,C2),
  cf1:c1 MAPPED BY (C3)
 )
SPLITS ('1970-01-01', '2000-01-01', '2010-01-01')
;

Splitting on NUMREGION:

CREATE HBASE TABLE T1 (
  C1 INT, C2 VARCHAR(10), C3 VARCHAR(20)
 )
COLUMN MAPPING (
  KEY MAPPED BY (C1,C2),
  cf1:c1 MAPPED BY (C3)
 )
NUMREGION (0, 10000, 20)
;


CREATE HBASE TABLE T1 (
  C1 DATE, C2 VARCHAR(10), C3 VARCHAR(20)
 )
COLUMN MAPPING (
  KEY MAPPED BY (C1,C2),
  cf1:c1 MAPPED BY (C3)
 )
NUMREGION ('1970-01-01', '2100-01-01', 20)
;

Creating a table as another table in HBase:


CREATE HBASE TABLE hb1
...
 AS SELECT * FROM  hbcopy
;

This example shows you how to create a table that uses complex data types such as STRUCT and ARRAY.

CREATE HBASE TABLE complex_types (
  key varchar(5), c1 STRUCT<f1:varchar(5), f2:varchar(5)>, c2 ARRAY<varchar(5)>)
COLUMN MAPPING
(
  KEY MAPPED BY (key),
  f:q1 mapped by (c1) ENCODING DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n',
  f:q2 mapped by (c2) ENCODING DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n'
);

Suppose that the following sample data is stored in the HDFS, in a file named data.csv in folder /tmp:

key1,f1v1/f2v1,a1v1/a2v1/a3v1
key2,f1v2/f2v2,a1v2/a2v2/a3v2
key3,f1v3/f2v3,a1v3/a2v3/a3v3

You can use the LOAD command to populate the table with this data, as shown in the following example:

load hadoop using file url '/tmp/data.csv'
  with source properties ('field.delimiter'=',','collection.items.delimiter'='/')
  into table complex_types;

After the data is loaded, you can query the complex types as individual elements.

select key, c1.f1, c1.f2, c2[1], c2[2], c2[3] from complex_types;