CREATE TABLE (HBASE) statement
Use the CREATE HBASE TABLE statement to create HBase tables for the Hadoop environment.
Invocation
This statement is an executable statement that cannot be dynamically prepared.
Syntax
- 1 Each clause can only be used once.
- 2 The Big SQL database manager natively supports the data types that are described in Data types. However, there are some data types that are syntactically supported for compatibility with Hive, but are physically treated as different data types within the Big SQL runtime. For more information see the mapping of data types that are used in the CREATE TABLE statement.
- 3 If you enable COMPATIBILITY_MODE, CHAR acts as a synonym for VARCHAR when you create the table. Then, the system catalogs show the VARCHAR data type.
- 4 Because the STRING data type has no specific size associated with it, the SQL processor might assume that it needs to manipulate 32K of data in a column all the time. Be aware that performance can be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: issue the SET HADOOP PROPERTY command or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
- 5 Do not use the SPLITALGO or NUMREGION clauses with ADD SALT. Also, SPLITALGO and NUMREGION are mutually exclusive.
Description
- EXTERNAL
-
The EXTERNAL parameter indicates that the table is already defined in HBase. In this case, Big SQL refers to that existing table. If the EXTERNAL parameter is not included on the HBase table definition, the HBase table is automatically created by Big SQL. When Big SQL refers to an external table, it does not validate that the structure of the external table matches the table definition in Big SQL until the time a query is run that uses the table.
- HBASE
- You must include the HBase keyword to define an HBase table. You can verify that you have
defined an HBASE table by querying the PROPERTY column in
SYSCAT.TABLES.
SELECT SUBSTRING(PROPERTY,24,1) FROM SYSCAT.TABLES WHERE TABNAME='<table-name>';
- table-name
- Specifies the name for the table that is to be created.
If the table name exists, this statement fails. However, if the table name exists and IF NOT EXISTS is specified, no error message is issued.
The table name in the table definition must be a valid identifier.
- column-definition
- The name, type, and comments of the column.
- column-name
- You must specify the list of columns that are to be created as non-partitioned columns in the
table. Each column is specified as a column name and data type. All valid Big SQL
data types are allowed.
The column name in the table definition must be a valid identifier.
- data-type
- Specifies the data type of the column.
- primitive-type
- The built-in data type. Use one of the following types:
- TINYINT
- A 1-byte integer data type, with the range of -128 to
127.
Hadoop automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.
- SMALLINT
- A small integer is a 2-byte integer with a precision of 5 digits. The range of small integers is -32,768 to 32,767
- INTEGER or INT
- A 4-byte integer with a precision of 10 digits. The range of large integers is -2,147,483,648 to +2,147,483,647.
- BIGINT
- An 8-byte integer data type with a precision of 19 digits. The range is -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
- FLOAT(integer)
- A single-precision floating-point number, which is a 32-bit approximation of a real number. The
number can be zero or can range from -3.4028234663852886e+38 to
-1.1754943508222875e-38, or from
1.1754943508222875e-38 to
3.4028234663852886e+38. A double-precision floating-point number
is a 64-bit approximation of a real number. The number can be zero or can range from
-1.7976931348623158e+308 to
-2.2250738585072014e-308, or from
2.2250738585072014e-308 to
1.7976931348623158e+308.
The REAL data type is a single precision floating-point number. The DOUBLE data type is a double-precision floating-point number.
- DECIMAL (precision-integer, scale-integer) or DEC(precision-integer, scale-integer)
- A decimal number. The first integer is the precision of the number; that is, the total number of
digits; it can range from 1 to 31. The second integer is the scale of the number; that is, the
number of digits to the right of the decimal point; it can range from 0 to the precision of the
number.
The default value is precision 10 and scale 0. The words NUMERIC and NUM can be used as synonyms for DECIMAL and DEC.
- CHARACTER or CHAR
- A character string is a sequence of characters. The length of the string is the number of characters in the sequence. If the length is zero, the value is called the empty string, which is not the NULL value.
- STRING
- The STRING data type has no specific size associated with it. The default is a VARCHAR 32K.
- VARCHAR
- A VARCHAR value can be up to 32 672 bytes long. You can also use CHARACTER VARYING or CHAR.
- TIMESTAMP
- A TIMESTAMP is a six- or seven-part value (year, month, day, hour, minute, second, and optional
fractional seconds) that designates a date and time.
The integer must be between 0 and 9 and specifies the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).
- DATE
- A DATE is a three-part value (year, month, and day) that designates a date.
The range of the year part is 0001 to 9999.
The
range of the month part is 1 to 12. The range of the day part is 1 to x, where
x depends on the month.
- DATE STORED AS TIMESTAMP
- Indicates that the DATE data type is mapped and stored as a Hive TIMESTAMP data type. This is the default.
- DATE STORED AS DATE
- Indicates that the DATE data type is mapped and stored as a Hive DATE data type. For optimal performance when using this type, be sure to use the ORC file format, because the Java™ I/O interface is used to access the DATE data type (see Big SQL readers and writers).
- BINARY or VARBINARY
- A binary value is a variable length sequence of bytes. The length of the value is the number of
bytes that it contains.
A varying length binary value is specified as VARBINARY(length) or BINARY(length) and can be up to 32,672 bytes long (VARBINARY is an alias for BINARY, and they are functionally identical in every way). If no length is provided, the length is assumed to be 32,672.
However, it is recommended that an explicit length be specified. The Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.
With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.
- array-type, row-type
-
You can define only 32,673 of ARRAY or ROW types for the Big SQL database (SQLSTATE 54035). For example, each instance of ARRAY or ROW consumes one of these entries, therefore a column of an ARRAY of ROW types consumes two entries. If you exceed the limit, you must drop a HADOOP table that contains a ROW or ARRAY defined as a column type.
Each ARRAY or ROW type can be used in one of two ways, the SQL standard version, or the Hive version. The SQL standard version, which is displayed in the syntax, is the recommended usage. When you use the length of the data type, which is the number of elements, as part of the table declaration, the database manager can take advantage of the fact that it can compute the maximum size of such a column, which optimizes the column.
With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.
For information about the usage restrictions, see Array values and Row values.
- ARRAY
-
An ARRAY is an indexable list. The elements in the array must be in the same type. You can access the elements in the array by the following notation:
...WHERE C1[2]=1...
, assuming that column C1 is declared as an ARRAY with two or more elements.- primitive-type or row-type
- Specifies the data type of the field.
- primitive-type
- Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types
- ROW
- Specifies the ROW data type. An ARRAY of ROW types is valid.
- ROW
-
A ROW object contains a value for each attribute of the SQL structured type that it represents.(java.sql.Struct). A ROW is equivalent to the STRUCT type in Hive tables. The ROW type contains field definitions that contain the field name and the data type.
For each field in a ROW definition, an entry is created in the SYSIBM.SYSATTRIBUTES table.
- field-name
- Specifies the name of a field within the ROW type. The name cannot be the same as any other field of this ROW type (SQLSTATE 42711).
- data-type
- Specifies the data type of the field.
- primitive-type
- Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types
- column-options
- You can define column options that include date, nullability, or constraints on the column.
- COMMENT column-comment
-
Each column can contain a COMMENT to provide more information about the column.
- NULL or NOT NULL
-
Each column can specify NOT NULL. The NOT NULL parameter signifies to the Big SQL query optimizer that the column will not contain NULL values. The HADOOP table is external to any local database server, so Big SQL enforces nullability at query execution time (SQL5104N).
NULL and NOT NULL options are used in the following ways:- The nullability of a column is used during certain query re-write optimizations.
- The nullability of a column is enforced during the INSERT or LOAD USING statement. Do not assign a NULL value to a column define as NOT NULL (SQLSTATE 23502).
- CONSTRAINT constraint-name
-
You can optionally identify each constraint with a name. For all constraints, the default is NOT ENFORCED.
- constraint-attributes
- These constraints can improve query performance. The default for Hadoop tables is NOT ENFORCED.
For primary key constraints, unique constraints, and check constraints, you can specify that the constraint can be trusted. For referential integrity constraints, if the constraint is not enforced, you can further specify whether the constraint can be trusted. A not-enforced and not-trusted constraint is also known as a statistical referential integrity constraint. You can specify whether a referential integrity constraint or check constraint is to be used for query optimization.
- NOT ENFORCED or ENFORCED
-
Specifies whether the constraint is enforced by the database manager during normal operations such as insert, update, or delete.
- NOT ENFORCED
-
This parameter is the default for Hadoop tables. The constraint is not enforced by the database manager. A primary key constraint or unique constraint cannot be NOT ENFORCED if there is a dependent ENFORCED referential constraint.
- TRUSTED
-
The data can be trusted to conform to the constraint. TRUSTED must be used only if the data in the table is independently known to conform to the constraint. Query results might be unpredictable if the data does not actually conform to the constraint. This is the default option.
- NOT TRUSTED
-
The data cannot be trusted to conform to the constraint. NOT TRUSTED is intended for cases where the data conforms to the constraint for most rows, but it is not independently known that all the rows or future additions will conform to the constraint. If a constraint is NOT TRUSTED and enabled for query optimization, then it will not be used to perform optimizations that depend on the data conforming completely to the constraint. NOT TRUSTED can be specified only for referential integrity constraints (SQLSTATE 42613).
- ENFORCED
-
The ENFORCED keyword is syntactically available for compatibility with the syntax for local tables (created with CREATE TABLE). However the presence of ENFORCED results in an error (SQLSTATE 42858) that indicates the operation is not supported on Hadoop tables.
- ENABLE QUERY OPTIMIZATION or DISABLE QUERY OPTIMIZATION
-
Specifies whether the constraint or functional dependency can be used for query optimization under appropriate circumstances. The default is ENABLE QUERY OPTIMIZATION.
- ENABLE QUERY OPTIMIZATION
-
The constraint is assumed to be true and can be used for query optimization.
- DISABLE QUERY OPTIMIZATION
-
The constraint cannot be used for query optimization. DISABLE QUERY OPTIMIZATION cannot be specified for primary key and unique constraints (SQLSTATE 42613).
- unique-constraint
-
Provides a shorthand method of defining a unique key. This rule forbids duplicate values in one or more columns within a table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL
- referential-constraint
- Defines a referential constraint.
- CONSTRAINT constraint-name
- Names the referential constraint.
- FOREIGN KEY (column-name, ...)
-
Defines a referential constraint with the specified constraint-name.
The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the object table and the same column must not be identified more than once. There must be the same number of foreign key columns as there are in the parent key and the data types of the corresponding columns must be compatible (SQLSTATE 42830).
- check-constraint
-
Defines a check constraint. A check-constraint is a search-condition that must evaluate to not false or a functional dependency that is defined between columns.
- constr-name
-
Names the check constraint.
- check-condition
-
Defines a check constraint. The search-condition must be true or unknown for every row of the table .
- COMMENT string-constant
-
A table comment as part of a table definition can be used to provide information about the table.
- TBLPROPERTIES string-constant=string-constant
-
Table properties defines job properties that can configure input or output formats. They can be used to specify more detailed information about the table.
- SECURITY POLICY policy-name
- Names the security policy to be associated with the table. The policy-name
identifies a security policy that already exists.
A table is not protected unless it has a security policy associated with it and it includes a column that is defined with a security label. Using a security label indicates that the table is a protected table with row level granularity.
- COLUMN MAPPING hbase-column-mapping
-
This clause is required on all HBase tables. It defines the mapping between the columns that are defined in the element-list and the stored location in the HBase table. It also describes the encoding of values in the HBase column.
- KEY MAPPED BY
-
Defines which columns are stored in the HBase row key. When a single column is specified for the row key, the column must be defined as NOT NULL, or an error is raised (SQL?????). When multiple columns are specified for the row key, they can contain a mix of either NULL or NOT NULL values. However, a row key cannot exist where all columns are null, or an error is aised (SQL?????).
Do not use an ARRAY or ROW data type as the row key.
- column-name
-
The name of a column as defined in the element-list.
- hbase-column MAPPED BY
-
Specifies the name of an HBase column into which a column-name is stored.
- hbase-column-family
-
The column family in which the target HBase column lives. Specify a valid identifier. The column family and column qualifier mappings are case sensitive. Therefore, cf:cq is not the same as cf:CQ.
- hbase-column-name
-
The column name within the hbase-column-family in which the column values are to be stored. You can specify either a string literal or an identifier.
- FORCE KEY UNIQUE
-
When you specify this parameter, a unique value is automatically generated and appended into the encoded HBase row key. This parameter allows duplicate rows to be stored within the HBase table. The generated unique value is not visible from SQL. However, the value does consume an additional 128 bytes of storage for each row in the HBase table. This parameter applies to the KEY MAPPED BY keyword only.
- ENCODING
-
Indicates how values are encoded into the HBase column.
- delimited-row-format
-
Data is encoded as text delimited data by using the same encoding logic that is used for ROW FORMAT DELIMITED text file storage. After the data is encoded into a text value, it is converted to UTF-8 bytes during final storage.Warning:
-
Delimited text encoding is not recommended for use in KEY mapping unless the data that is being stored is already in text format. Because the string representation of numeric values does not collate (for example “1” < “19” < “2”), Big SQL will not push down predicates on numeric data types encoded by using delimited encoding.
-
The default null definition is \N. If the separator you choose will get escaped by using the the same byte that is used to escape \N, the result will not be correct. To avoid this error, choose a separator that does not clash, or use the NULL DEFINED AS clause to define a different null representation. For example, if \b is chosen as separator, over-ride the default null representation.
'cf1:cq1 MAPPED BY (c4,c5) SEPARATOR '\b' ENCODING string NULL DEFINED AS '
-
- BINARY
-
Data is encoded in a binary format. The default is BINARY. BINARY is the recommended storage format, because of the following benefits:
- It performs better when encoding and decoding data.
- It ensures proper collation of values when they are used in the row key.
- It generally requires less storage than STRING encoding.
- USING serde-row-format
-
Data is encoded using a Hive SerDe class to encode and decode the columns that are contained in the column mapping.Warning: Columns that are encoded with a SerDe have the following restrictions:
- You cannot use these columns within a secondary index that is created by the CREATE INDEX statement.
- The search criteria on these columns cannot be pushed into HBase for processing. This means that additional rows might be returned to Big SQL to be filtered there, rather than having the region server do the row filtering. Specifically, this means that a custom encoded HBase row key results in a full HBase table scan.
- STRING
- Using STRING encoding is equivalent to using DELIMITED encoding with no other options.
- ADD SALT
-
By using this parameter, Big SQL adds a prefix to the row key. The prefix is based upon a hash function on the values that are contained within the row key. The ADD SALT parameter provides a way to attempt to scatter data that would otherwise be clustered and form “hot spots” of activity in the HBase region servers.
- BUCKETS=salt-buckets
-
Specifies the number of buckets in which to put the prefixed row keys. By default, Big SQL uses the number of region servers that exist at the time of the CREATE TABLE as the number of buckets. The maximum number of buckets that are allowed is 32.
- CLASS='class-name'
-
Specifies the name of the class that does the salting. By default, the ADD SALT option uses the Big SQL algorithm, com.ibm.biginsights.bigsql.hbasecommon.salt.BigSqlSalt.
You can use any class that extends the Big SQL salting algorithm, or you can define your own salting algorithm. Ensure that the class is available to Big SQL and that it extends com.ibm.biginsights.bigsql.hbasecommon.salt.BigSqlSalt. Use the following steps to make the class available to Big SQL:- As the bigsql user type the following command:
$BIGSQL_HOME/bin/bigsql stop
. - Add the JAR file that contains the salting class to the $BIGSQL_HOME/bigsql/userlib folder.
- Type the following command:
$BIGSQL_HOME/bin/bigsql start
.
- As the bigsql user type the following command:
- PROPERTIES
- Specifies the properties that can used to configure a custom salting class, in the form of
quoted name and value pairs. For example,
properties = ('padding'=2)
- hbase-split-options
-
When you specify these options, the HBase table is created as pre-split regions across region servers.
- SPLITS (split-point1, split-point2, ..., split-pointN)
-
Provides a starting value for the first column that is stored in the HBase row key at which a split point begins. The total number of regions that are created is the number of split keys plus one.
You can specify the values as string or numeric representations of a value of the data type of the first column that is contained in the HBase row key. You can see examples of splitting in the Examples section.
- SPLITS_FILE file-path
-
Specifies a file containing split points. The file path must be a fully qualified distributed file system (DFS) path to a file that contains split point values. In the file, there must be one value for each line. The contents of this file are treated as if those values were provided to the SPLITS clause.
- SPLITALGO [split-algorithm,num-region-server]
-
Use this parameter to include a split algorithm that evenly divides by the value of number-of-regions, the space of possible keys. This parameter is useful when the keys are relatively uniform. This parameter is mutually exclusive with NUMREGION. Do not use SPLITALGO with NUMREGION in the same statement. Do not use this parameter with the ADD SALT keyword.
You can select one of two algorithms that are included when you install Big SQL. Choose the algorithm that works best with your data:
- 'HexStringSplit'
-
The format of a 'HexStringSplit' region boundary is any uniformly distributed hexadecimal value. This split algorithm uses hex strings as keys, which is easy to read. However, the results can take up more space.
- 'UniformSplit'
-
By using this class, you divide the space of possible keys evenly. This algorithm is useful when the keys are nearly uniform random bytes. The results use less space, but they might not be as readable.
- NUMREGION (range-start,range-end,number-of-regions)
-
Use this parameter to include a split algorithm that evenly divides the space of possible keys by the value of the number of regions. This parameter is useful when the keys are relatively uniform. This parameter is mutually exclusive with SPLITALGO. Do not use SPLITALGO with NUMREGION in the same statement. Do not use this parameter with the ADD SALT keyword.
You can specify the range-start and range-end values as a number or a string literal value of the first column in the HBase row key. The examples in the Examples section show the use of NUMREGION.
- DEFAULT COLUMN FAMILY OPTIONS
-
This parameter specifies the HBase configuration options that are applied, by default, to all column families that are defined for the HBase table. You can define or override these options for individual column families by using the COLUMN FAMILY OPTIONS parameter.
- COLUMN FAMILY OPTIONS (OPTION NAME (option-value) ...)
-
This parameter specifies the HBase configuration options that are applied to a specific column family that is used by the HBase table. You specify the options in the form of an option name and option value pair. You can omit the option value for boolean options, which have a true or false value, to indicate a true value. The following options are available in the COLUMN FAMILY OPTIONS parameter:
- COMPRESSION
-
Use this option to compress files to reduce the amount of storage that is used. The default is NONE, which means no compression for each specified HBase family, even when no COLUMN FAMILY OPTIONS parameter is used.
The following compression algorithms are valid. The compression ratio and performance during encoding and decoding of each algorithm depends on your environment and use case:
- NONE
-
Indicates that no compression is used when storing the column family. This value is the default.
- GZ
- Select this option to use GZIP compression.
- SNAPPY
- Select this option to use SNAPPY compression.
- LZ4
- Select this option to use LZ4 compression.
- BLOOM FILTER
-
Bloom filters help identify whether a row and column combination exists in a block without loading it. Using bloom filters uses extra space. However, using bloom filters improves look-up times and helps reduce block cache churn by loading only blocks that are required.
- NONE
- Indicates that no bloom filter is maintained. This is the default value, even when you do not specify a COLUMN FAMILY OPTIONS parameter.
- ROW
- Indicates that the hash of the row is added to the bloom on each insert. Use this option when queries have predicates on row keys.
- ROWCOL
- Indicates that the hash of the row plus column family plus column family qualifier is added to the bloom on each key insert. Use this option when queries have column projections.
- IN MEMORY
-
The possible values are false (the default) or true. A value of true indicates that data is to be kept in cache for the longest period possible to improve performance.
- NO IN MEMORY
-
The NO IN MEMORY option is retained for backwards compatibility with prior Big SQL releases. Use the IN MEMORY option instead.
- BLOCKCACHE
-
Blocks in HBase are a unit of storage for memory. The possible values for BLOCKCACHE are true or false. When you enable BLOCKCACHE, read performance can improve. For frequently accessed rows, enable BLOCKCACHE.
- BLOCKSIZE
-
You specify this value in bytes. It is used to control how much data HBase is required to read from the storage files during retrieval and what is cached in memory for subsequent accesses.
- DATA_BLOCK_ENCODING
-
This option defines in-cache key compression. By using this option, you can take advantage of the redundancy between sorted keys in an HFile block. You can choose to store only the differences between consecutive keys.
- KEEP_DELETED_CELLS
-
You can enable this option so that deleted cells can be retrieved with a GET or SCAN operation, as long these operations have a time range specified that ends before the timestamp of any delete that would affect the cells. This allows for point in time queries even in the presence of deletes. The possible values for KEEP_DELETED_CELLS are true or false.
- VERSIONS
-
This option is an integer value that represents the number of row versions to retain.
- MIN_VERSIONS
-
The minimum number of row versions to keep for each column family. The default value is 0, which disables the option. You can use this parameter with the time to live (TTL) parameter to keep the last TTL seconds worth of data, and a minimum of MIN_VERSIONS.
- REPLICATION_SCOPE
-
To minimize the affects of system failures, enable replication on the column family. The default value is 0, which means that the column family is not replicated. A value of 1 means that the column family is replicated.
- TTL
-
This option defines a time to live (TTL) in seconds for the column family. HBase automatically deletes all versions of the row (including the most current) that have exceeded the specified time to live.
- HBASE TABLE NAME
-
Specifies the name of the underlying HBase table. By using this clause, you can give a different name to a table in Big SQL. HBase table names are case sensitive. For example, if you specify HBASE TABLE NAME hbasetable1, but the actual HBase table name is HBaseTable1, the statement returns an error indicating that the HBase table does not exist.
If your HBase table names are similar, such as hbasetable1 and HBaseTable1 except for the case, then you can use this clause to differentiate among them in Big SQL. For example, if your HBase table is named tab1, you can refer to that table as tab1. A second HBase table, named TAB1, can be referred to as tab1_2 in Big SQL. If tab1 and TAB1 already exist in HBase, then they can be referenced with tab1 and tab1_2 in Big SQL by using the EXTERNAL clause.
If the HBASE table name is not specified, the name of the HBase table is schema.tablename where the schema is the schema in which the table is defined in Big SQL and the table name is the name of the table as defined in Big SQL. Both names will be transformed as needed, to conform to Hive identifier rules. For more information about Identifiers in Hive, see Identifiers.
If you create an external HBase table on top of an existing HBase table by using the HBASE TABLE NAME clause, make sure that you following these naming rules:Characters for user names and group names must be lowercase a through z, 0 through 9, and _ (underscore) for names not starting with 0 through 9.
- DEFAULT ENCODING
-
Specifies the default data encoding that is used by any column mapping that does not explicitly specify one (see the information on ENCODING). If you do not specify an encoding, the default encoding is BINARY, which is the optimal value for performance and correctness.
- (column-name) AS (full-select) WITH NO DATA
-
Creates a table that is based on a SELECT statement. The table name cannot refer to an HBase table that contains an ARRAY or ROW column (SQLSTATE 428H2).
You can optionally rename the columns in the target table.
By using the WITH NO DATA clause, you can define the target table without populating the table.
Usage notes
- When you use data types such as STRING and BINARY, you can
cause the SQL processor to assume that it needs to manipulate 32K of data in a column all the time.
This result can impact performance. Instead, use data types that contain a specific length. You can
also set the bigsql.string.size property to default the STRING mapping to a
smaller VARCHAR size. Set this property in one of two ways:
- Run the SET HADOOP PROPERTY command.
- Set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
- If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(9), which is compatible with a maximum precision of a TIMESTAMP in Hive.
- If you create a table in a schema that does not exist, the schema is automatically created with the default name. The default name is your User ID.
Restrictions
- Do not use an ARRAY or ROW or STRUCT data type as the row key.
- Do not use a ROW or STRUCT or ARRAY data type in a constraint or key definition (SQLTATE 42962).
- Do not reference a table that contains a column of type ARRAY or ROW or STRUCT in the AS or LIKE clause (SQLSTATE 428H2).
- There is a limit of 32,763 ARRAY or ROW or STRUCT definitions per database (SQLSTATE 54035). This limit also includes any user defined types that use the CREATE TYPE statement.
- Comparisons or assignments between ARRAY or ROW or STRUCT variables or using ARRAY data types ar parameters or ARRAY as ROW parameters to a stored procedures are not supported with ARRAY columns in a Hadoop table.
- Do not use an ARRAY data type as a type within an ARRAY type for Hadoop tables.
- The result type of a column in a full-select cannot be of type ARRAY.
The use of scrollable cursors is not supported for HBase tables.
Examples
- Create an HBase table with composite keys and dense columns:
In this example, the HBase table, mixed_encodings has 6 SQL columns, c1 to c6. The HBase table is created with one column family (cf1) with two columns (cq1 and cq2). The HBase row key is mapped to three SQL columns (composite key). It uses binary encoding because the ENCODING clause specifies binary. The HBase column, cf1:cq1 is mapped to two SQL columns (dense column). It is encoded as a string and uses '\b' as the separator character. The HBase column, cf1:cq2 is mapped to a single SQL column, c6, and is encoded as binary.
CREATE HBASE TABLE mixed_encodings ( c1 int,c2 int,c3 int,c4 varchar(20),c5 varchar(40),c6 varchar(90) ) COLUMN MAPPING ( KEY MAPPED BY (c1,c2,c3), cf1:cq1 MAPPED BY (c4,c5) ENCODING DELIMITED FIELDS TERMINATED BY '\b', cf1:cq2 MAPPED BY (c6) ENCODING BINARY ) DEFAULT ENCODING BINARY ;
- Create an HBase table where all columns are in the primary key:In this example, you can see that an HBase table can be created where all of the columns for a table are contained in the HBase primary key. There are no HBase columns specified in this table creation.
CREATE HBASE TABLE allkey ( k1 VARCHAR(10), k2 VARCHAR(70),k3 VARCHAR(30) ) COLUMN MAPPING ( KEY MAPPED BY (k1, k2, k3) );
- Create an HBase table with column options:
CREATE HBASE TABLE colopt1 ( akey VARCHAR(10),c1 VARCHAR(70),c2 VARCHAR(30) ) COLUMN MAPPING ( KEY MAPPED BY (akey), cf1:c1 MAPPED BY (c1), cf2:c2 MAPPED BY (c2) ) COLUMN FAMILY OPTIONS ( cf1 COMPRESSION(GZ) BLOOM FILTER(ROW) IN MEMORY, cf2 NO IN MEMORY );
- Create an HBase table using SerDe encoding:
CREATE EXTERNAL HBASE TABLE tableName ( c1 BIGINT, c2 INT, c3 INT ) COLUMN MAPPING ( KEY MAPPED BY (c1), cf1:cq1 MAPPED BY (c2,c3) ENCODING USING SERDE 'com.example.SerDeClass' WITH SERDEPROPERTIES ('key1'='value1','key2'='value2') );
- Create an HBase table by using salting techniques:This example shows how to use the salting techniques so that you can spread out the hot spots in the range. New records are split into multiple buckets, and each record goes to a different region in the cluster.
CREATE HBASE TABLE HBTABLE ( c1 BIGINT,c2 INT,c3 INT ) COLUMN MAPPING ( KEY MAPPED BY (c1), cf1:cq1 MAPPED BY (c2,c3) ) ADD SALT ( BUCKETS = 3 CLASS = 'com.sample.SimpleSalt' PROPERTIES = (‘padding’ = ‘2’) );
This example implicitly creates four splits, which consist of the three buckets plus one.
- Create an HBase table with default salt options:This example defines a table with salt option with no parameters. Big SQL uses the default salting algorithm, which defines the number of buckets as the number of region servers that exist at the time the table is created.
CREATE HBASE TABLE mysalt1 ( rowkey INT,c0 INT ) COLUMN MAPPING ( KEY MAPPED BY (rowkey), f:q MAPPED BY (c0) ) ADD SALT;
- Create an HBase table with a combination of salting and splitting:
This example uses a combination of salting and explicit splitting to produce seven splits (the number of explicit splits plus one). You must know your data to use the split option. Big SQL does not validate your split information.
CREATE HBASE TABLE salt_and_explicit_split ( rowkey VARCHAR(40),c0 INT ) COLUMN MAPPING ( KEY MAPPED BY (rowkey), f:q MAPPED BY (c0) ) ADD SALT ( buckets=3 ) SPLITS ( '0a', '0n', '1a', '1n', '2a', '2n' );
- Create an HBase table by using a compound row key:This example shows a compound row key with a numeric value and BINARY default encoding:
CREATE HBASE TABLE hbtable ( c1 INT,c2 INT,c3 INT, c4 VARCHAR(10),c5 VARCHAR(20), c6 VARCHAR(30) ) COLUMN MAPPING ( KEY MAPPED BY (c1,c2,c3), cf1:cq1 MAPPED BY (c4,c5) ENCODING DELIMITED FIELDS TERMINATED BY '\b' cf1:cq2 MAPPED BY (c6) ENCODING binary ) DEFAULT ENCODING BINARY NUMREGION(10,30,5);
The table is created with splits on C1. The NUMREGION parameter defines the starting point of the range as 10, and the ending point of the range as 30, with 5 regions in which to split. The regions for the hbtable table are shown in the following table:Table 1. Example split regions Region Split on C1 values 1 Less than 10 2 10 - 15 3 16 - 21 4 22 - 29 5 30 and greater Be aware that if C1 in the hbtable table contains character or string values, the split might not be as efficient when the ENCODING is BINARY. Match the encoding with the value type of the row key column for more efficient splits.Table 2. Guidelines for splitting and using default encoding Row key value Default encoding Numeric values BINARY ENCODING Character values STRING ENCODING String values STRING ENCODING - Create an HBase table by using a custom SerDe:The following example includes a custom SerDe. By using this SerDe, you can read in data that is stored as JSON objects and convert that data to a JSON string that can be understood by HBase. The actual content of your HBase table is from JSON documents:
The example insert is:CREATE HBASE TABLE post ( id INT, title STRING, content STRING, created_on TIMESTAMP, updated_on TIMESTAMP, published BOOLEAN ) COLUMN MAPPING ( KEY MAPPED BY(id), cf1:cq1 MAPPED BY(title, content, created_on, updated_on, published) ) DEFAULT ENCODING USING SERDE 'com.ibm.biginsights.bigsql.serde.JSONSerDe';
insert into post values(1, 'Hello world!', 'First post!', '2013-01-01 01:01:01.001', '2013-01-01 01:01:01.001', true);
- Splitting on string or numeric representations:
CREATE HBASE TABLE T1 ( C1 INT, C2 VARCHAR(10), C3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (C1,C2), cf1:c1 MAPPED BY (C3) ) SPLITS (1000, 2000, 3000) ;
CREATE HBASE TABLE T1 ( C1 DATE, C2 VARCHAR(10), C3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (C1,C2), cf1:c1 MAPPED BY (C3) ) SPLITS ('1970-01-01', '2000-01-01', '2010-01-01') ;
- Splitting on NUMREGION:
CREATE HBASE TABLE T1 ( C1 INT, C2 VARCHAR(10), C3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (C1,C2), cf1:c1 MAPPED BY (C3) ) NUMREGION (0, 10000, 20) ;
CREATE HBASE TABLE T1 ( C1 DATE, C2 VARCHAR(10), C3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (C1,C2), cf1:c1 MAPPED BY (C3) ) NUMREGION ('1970-01-01', '2100-01-01', 20) ;
- Creating a table as another table in HBase:
CREATE HBASE TABLE hb1 ... AS SELECT * FROM hbcopy ;
- This
example shows you how to create a table that uses complex data types such as STRUCT and
ARRAY.
CREATE HBASE TABLE complex_types ( key varchar(5), c1 STRUCT<f1:varchar(5), f2:varchar(5)>, c2 ARRAY<varchar(5)>) COLUMN MAPPING ( KEY MAPPED BY (key), f:q1 mapped by (c1) ENCODING DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n', f:q2 mapped by (c2) ENCODING DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n' );
Suppose that the following sample data is stored in the HDFS, in a file named data.csv in folder /tmp:
You can use the LOAD command to populate the table with this data, as shown in the following example:key1,f1v1/f2v1,a1v1/a2v1/a3v1 key2,f1v2/f2v2,a1v2/a2v2/a3v2 key3,f1v3/f2v3,a1v3/a2v3/a3v3
load hadoop using file url '/tmp/data.csv' with source properties ('field.delimiter'=',','collection.items.delimiter'='/') into table complex_types;
After the data is loaded, you can query the complex types as individual elements.select key, c1.f1, c1.f2, c2[1], c2[2], c2[3] from complex_types;