CREATE TABLE (HBASE) statement
The CREATE TABLE (HBASE) statement defines an HBase table for the Hadoop environment. The definition must include its name and the names and attributes of its columns. The definition can include other attributes of the table, such as its primary key or check constraints.
This statement can be embedded in an application program or issued by using dynamic SQL statements. It is an executable statement that can be dynamically prepared only if DYNAMICRULES run behavior is in effect for the package (SQLSTATE 42509).
- One of the following privileges or authorities:
- USE privilege on the table space
- SYSADM authority
- SYSCTRL authority
- Plus one of these privileges or authorities:
- IMPLICIT_SCHEMA authority on the database, if the implicit or explicit schema name of the table does not exist
- CREATEIN privilege on the schema, if the schema name of the table refers to an existing schema
- SCHEMAADM authority on the schema, if the schema name of the table refers to an existing schema
- REFERENCES privilege on the table
- REFERENCES privilege on each column of the specified parent key
- CONTROL privilege on the table
- SCHEMAADM authority on the schema, if the schema name of the parent table refers to an existing schema
- DBADM authority
- 1 Each clause can be used only once.
- 2 Db2 Big SQL supports the data types that are described in Data types. However, some data types are syntactically supported for compatibility with Hive but are treated as different data types within the Db2 Big SQL runtime environment. For more information, see the mapping of data types that are used in the CREATE TABLE statement.
- 3 Because the STRING data type has no specific size associated with it, the SQL processor might assume that each value in a STRING column always contains 32 KB of data, and performance might be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: run the SET HADOOP PROPERTY command, or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
- 4 Do not use the SPLITALGO or the NUMREGION clause with the ADD SALT clause. SPLITALGO and NUMREGION are mutually exclusive.
- Indicates that the table is already defined in HBase. In this case, Db2 Big SQL refers to that existing table. If the EXTERNAL keyword is not specified, the HBase table is automatically created by Db2 Big SQL. When Db2 Big SQL refers to an external table, it does not verify that the structure of the external table matches the table definition in Db2 Big SQL until a running query references the table.
- This keyword is required when you are defining an HBase table. You can verify that you have
defined an HBase table by querying the PROPERTY column in the SYSCAT.TABLES catalog view. For
SELECT SUBSTRING(PROPERTY,22,1) FROM SYSCAT.TABLES WHERE TABNAME='<table-name>';
- IF NOT EXISTS
- Checks whether the specified table name already exists. If the name exists, no error is returned.
- Specifies a unique name for the new table. The table name must be a valid identifier. The table name cannot refer to an HBase table that contains an ARRAY or ROW column (SQLSTATE 428H2).
- Defines the elements of a table, including the definition of columns and constraints on the table.
- Defines the attributes of a column.
- Names a column of the table. The name cannot be qualified, and the same name cannot be used for more than one column of the table. The column name must be a valid identifier.
- Specifies the data type of the column.
- There are several built-in data types:
- A 1-byte integer with a range of -128 to 127.
Db2® Big SQL automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.
- SMALLINT | INT2
- A 2-byte integer with a precision of 5 digits and a range of -32,768 to 32,767.
- INTEGER | INT | INT4
- A 4-byte integer with a precision of 10 digits and a range of -2,147,483,648 to +2,147,483,647.
- BIGINT | INT8
- An 8-byte integer with a precision of 19 digits and a range of -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
- FLOAT(integer) | FLOAT 8 | REAL | FLOAT4 | DOUBLE | FLOAT8
- A single-precision or double-precision floating-point number.
A single-precision floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.4028234663852886e+38 to -1.1754943508222875e-38, or from 1.1754943508222875e-38 to 3.4028234663852886e+38.
A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.
A REAL number is a single-precision floating-point number. A DOUBLE number is a double-precision floating-point number.
- CHARACTER(integer) | CHAR(integer)
- A character string is a sequence of characters. The length of the string (integer) is the number of characters in the sequence. The default is 1.
- STRING | VARCHAR(integer) | CHARACTER VARYING(integer) | CHAR VARYING(integer)
- The STRING data type has no specific size associated with it. The default is VARCHAR(32672).
- TIMESTAMP | DATETIME
- A six- or seven-part value (year, month, day, hour, minute, second, and optional fractional
seconds) that designates a date and time.
If specified, integer must be between 0 and 9 and represents the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).
- A three-part value (year, month, and day) that designates a date. The range of the year part is 0001 to 9999. The range of the month part is 1 to 12. The range of the day part is 1 to x, where x depends on the month.
- DECIMAL | DEC | NUMERIC | NUM (precision-integer, scale-integer)
- A decimal number with defined precision and scale. The precision is the total number of digits and can range from 1 to 31. The scale is the number of digits to the right of the decimal point and can range from 0 to the precision of the number. The default precision is 10 and scale is 0.
- BINARY | VARBINARY (integer)
- A varying length binary value is specified as VARBINARY(length) or
BINARY(length) and can be up to 32,672 bytes long. VARBINARY is an alias for
BINARY, and they are functionally identical in every way. If no length is provided, the length is
assumed to be 32,672.
However, it is recommended that an explicit length be specified. The Db2 Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.
With this data type, the Java™ I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Db2 Big SQL.
- A Boolean value.
- array-type | row-type
- For detailed information about these types, see Array values and Row values.
Each ARRAY or ROW type can be used in one of two ways: the SQL standard version or the Hive version. The SQL standard version, which is shown in the syntax diagram, is the recommended usage. When you specify the number of elements in the array as part of the table definition, the database manager can compute the maximum size of the column, which enhances query optimization.
With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Db2 Big SQL.
- An ARRAY is a list that can be indexed. The elements in the array must be of the same type.
Assuming that column C1 is declared as an ARRAY with two or more elements, you can access these
elements by using the following notation:
...WHERE C1=1.... An ARRAY of ROW types is valid.
- For simple arrays, an integer constant that indicates the scale of the array, which is the number of elements in the array.
- INT | INTEGER | VARCHAR (int)
- The data type of the associative array index.
- A ROW object contains a value for each attribute of the SQL structured type that it represents
(java.sql.Struct). The ROW type is equivalent to the STRUCT type in Hive tables. The ROW type
contains field definitions that contain field names and their data types. An entry is created in the
SYSIBM.SYSATTRIBUTES table for each field in a ROW definition.
- Specifies the name of a field for the ROW type. The name cannot be the same as another field of this ROW type (SQLSTATE 42711).
- Specifies the data type of the field. The data type must be a primitive data type.
- You can define column options including a comment, the nullability attribute, or constraints.
- COMMENT column-comment
- A comment that provides additional information about the column.
- NULL | NOT NULL
- NOT NULL specifies that the column will not contain null values. Because the Hadoop table is external to any local database server, Big SQL enforces nullability at query execution time (SQL5104N). The nullability attribute of a column is enforced during an INSERT or LOAD USING operation. Do not assign a null value to a column that is defined as NOT NULL (SQLSTATE 23502).
- CONSTRAINT constraint-name
- Specifies the name of a constraint. You can define the following constraints on a column:
- PRIMARY KEY
- A primary key must be unique and must have the NOT NULL attribute. A table cannot have more than one primary key.
- A unique key prevents duplicate values in one or more NOT NULL columns of a table. A table can have multiple unique keys.
- REFERENCES table-name
- Ensures that required table relationships are maintained and that data entry rules are followed. The column on which this parameter is specified becomes the foreign key to the referenced parent table table-name.
- Defines attributes that are associated with primary key, unique, referential integrity, or check constraints.
- NOT ENFORCED | ENFORCED
- Specifies whether the constraint is enforced by the database manager during normal operations
such as insert, update, or delete.
- NOT ENFORCED
- Specifies that the constraint is not enforced by the database manager. This is the default for
Hadoop tables. A primary key or unique constraint cannot be NOT ENFORCED if there is a dependent
ENFORCED referential constraint.
- TRUSTED | NOT TRUSTED
- Specifies whether the data can be trusted to conform to the constraint.
- Specifies that the data can be trusted to conform to the constraint. This is the default. Use this option only if the table data is independently known to conform to the constraint; otherwise, unpredictable results might occur.
- NOT TRUSTED
- Specifies that the data cannot be trusted to conform to the constraint. This option is valid only for referential integrity constraints (SQLSTATE 42613). Use this option when the data conforms to the constraint for most rows, but it is not independently known whether all current and future rows will conform. If a constraint is both not trusted and enabled for query optimization, it will not be used for any optimization that depends on the data being in complete conformity with the constraint.
- Although the ENFORCED keyword is available for compatibility with the syntax for local tables that are created with the CREATE TABLE statement, an error is returned when you specify this option for Hadoop tables (SQLSTATE 42858).
- ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION
- Specifies whether the constraint or functional dependency can be used for query optimization
under appropriate circumstances.
- ENABLE QUERY OPTIMIZATION
- Specifies that the constraint is assumed to be true and can be used for query optimization. This is the default.
- DISABLE QUERY OPTIMIZATION
- Specifies that the constraint cannot be used for query optimization. This option cannot be specified for primary key and unique constraints (SQLSTATE 42613).
- Defines a unique key. This constraint does not allow duplicate values in one or more columns of the table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL.
- Defines a referential constraint.
- CONSTRAINT constraint-name
- Specifies a name for the referential constraint.
- FOREIGN KEY (column-name, ...)
- The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the parent table, and the same column must not be identified more than once. The number of foreign key columns must be the same as the number of parent key columns, and the data types of the corresponding columns must be compatible (SQLSTATE 42830).
- COMMENT string-constant
- Specifies a comment that provides additional information about the table. The maximum length of string-constant is 254 bytes.
- TBLPROPERTIES ('string-constant1' = 'string-constant2')
- Defines job properties that can configure input or output formats, and that you can use to
specify more detailed information about the table. The properties are passed
as isto the underlying InputFormat and OutputFormat classes. Properties that are not implemented by these classes are ignored. Db2 Big SQL does not validate the properties.
- COLUMN MAPPING hbase-column-mapping
- This clause is required for all HBase tables. It specifies the mapping between columns that are
defined in element-list and their stored locations. It also describes the
encoding of values in HBase columns.
- KEY MAPPED BY column-name
- Specifies one or more columns that are to be stored in the HBase row key. If a single column is specified, that column must be defined as NOT NULL. If multiple columns are specified, some, but not all of those columns can be defined as nullable. Do not specify array-type or row-type columns.
- FORCE KEY UNIQUE
- Specifies that a unique value is to be automatically generated and appended to the encoded HBase row key. This clause enables duplicate rows to be stored in an HBase table. The generated value is not visible through SQL, but does consume an additional 128 bytes of storage for each row in the HBase table. This clause is effective only when the KEY MAPPED BY clause is specified.
- Specifies how values are encoded in HBase columns.
- Data is encoded in a binary format. This is the default. Binary encoding has the following benefits:
- It performs better during data encoding and decoding.
- It ensures the proper collation of values when they are used in a row key.
- It generally requires less storage than string encoding.
- Data is encoded as delimited text, after which it is converted into UTF-8 during final storage.
- Valid values for the FIELDS TERMINATED BY, ESCAPED BY, and LINES TERMINATED BY parameters are
the following items:
- A literal single-byte character such as pipe (|) or comma (,).
- A value of the format
\nnn, where nnn is the octal value of a character. For example, \001 is the ASCII character that represents CTRL-A.
- A value of the format
\xYY, where YY is a two-digit hexadecimal value of a character. This representation is retained for compatibility with Db2 Big SQL v1; however, the literal character or octal representations are recommended.
- FIELDS TERMINATED BY terminator-char
- Specifies a delimiter for structured fields. You can include any single character, but the default is \001, the ASCII character that represents CTRL-A. Characters that match the field terminator character can be part of your data if you specify the ESCAPED BY clause and those characters are properly escaped.
- COLLECTION ITEMS TERMINATED BY terminator-char
- Specifies a delimiter for the individual items in a collection. You can include any single character, but the default is \002, the ASCII character that represents CTRL-B.
- LINES TERMINATED BY terminator-char
- Specifies a delimiter for data rows. The only valid delimiter is a newline character (\n). Newline characters must not exist within your data because they cannot be escaped.
- NULL DEFINED AS null-value
- Specifies a value to represent the null value. The default is a literal backslash character followed by an uppercase N (\N).
- Delimited text encoding is not recommended for use in key mapping unless the data that is being
stored is already in text format. Because the string representation of numeric values does not
collate (for example,
'1' < '19' < '2'), Db2 Big SQL does not push down predicates on numeric data types that were encoded as delimited text.
- Because the default null value representation is \N, if the delimiter you
choose will be escaped by the same byte that is used to escape \N, the result will
not be correct. To avoid this scenario, choose a different delimiter, or use the NULL DEFINED AS
clause to specify a different byte to represent null values. For example, if \b is
the delimiter, override the default null value representation, as shown in the following
cf1:cq1 MAPPED BY (c4,c5) SEPARATOR '\b' ENCODING STRING NULL DEFINED AS ''
- USING serde-row-format
- Data is encoded by using a Hive SerDe class to encode and decode the columns that are contained
in the column mapping.Warning:
- You cannot use these columns in a secondary index that is created by the CREATE INDEX statement.
- Search criteria on these columns cannot be pushed into HBase for processing. This means that rows might be returned to Db2 Big SQL for filtering.
- SERDE serde-class
- Specifies the name of a Java class in the Db2 Big SQL server CLASSPATH that implements the Hive SerDe interface org.apache.hadoop.hive.serde2.SerDe.
- WITH SERDEPROPERTIES ('string-constant1' = 'string-constant2')
- Associates SerDe properties with the SerDe class. For example, if you want to eventually load HBase data into a Hive table, create the table by using this option and the hbase.columns.mapping parameter.
- Using string encoding is equivalent to using delimited encoding with no other options.
- Specifies the case-sensitive qualified name of an HBase column in which values are to be stored.
The format of this name is
- Specifies the column family that qualifies a target HBase column name. A column family is an internal storage column definition under which the table columns are stored. This value must be a valid identifier.
- Specifies a name that identifies a column that belongs to hbase-column-family. You can specify a string literal or a valid identifier.
- MAPPED BY (column-name, ...)
- Specifies one or more SQL columns that are mapped to an HBase column.
- ADD SALT
- Specifies that Db2 Big SQL is to add a prefix to the row key. This prefix is based on a hash
function that is applied to values in the row key. This clause provides a way to scatter data that
would otherwise be clustered and form hot spots of activity in the HBase region servers.
- Specifies the number of buckets in which to put the prefixed row keys. The default value is the number of region servers that exist when the statement runs. The maximum value is 64.
Specifies the name of the class that does the salting. The default is com.ibm.biginsights.bigsql.hbasecommon.salt.BigSqlSalt.You can specify any class that extends the default algorithm, or you can define your own salting algorithm. Ensure that the class is available to Db2 Big SQL and that it extends com.ibm.biginsights.bigsql.hbasecommon.salt.BigSqlSalt. Complete the following steps to make the class available to Db2 Big SQL:
- As the bigsql user, run the following command to stop Db2 Big SQL:
- Add the JAR file that contains the salting class to the $BIGSQL_HOME/bigsql/userlib directory.
- Run the following command to restart Db2 Big SQL:
- As the bigsql user, run the following command to stop Db2 Big SQL:
- Specifies the properties that can be used to configure a custom salting class. The properties
are specified in the form of name/value pairs in which the name is delimited by single quotation
marks. For example:
properties = ('padding'=2)
- Specifies that the HBase table is to be created as pre-split regions across region servers.
- SPLITS (split-point1, split-point2, ..., split-pointN)
- Provides a starting value for the first column that is stored in the HBase row key at which a
split point begins. The total number of created regions is the number of split points plus
You can specify the values as string or numeric representations of a value whose data type matches that of the first column in the HBase row key. The Examples section shows how to use the SPLITS clause.
- SPLITALGO [split-algorithm,num-region-servers]
- Specifies a split algorithm that evenly divides the number of possible keys by the number of
regions. This clause is useful when the keys are relatively uniform. Do not use NUMREGION with
SPLITALGO in the same statement. Do not use ADD SALT with SPLITALGO.
You can select one of two algorithms that are included when you install Db2 Big SQL. Choose the algorithm that works best with your data.
- Specifies a region boundary as a uniformly distributed hexadecimal value. This split algorithm uses hex strings as keys, which are easy to read but can consume more space.
- By using this class, you divide the space of possible keys evenly. This algorithm is useful when the keys are nearly uniform random bytes. The results use less space, but they might not be as readable.
- NUMREGION (range-start,range-end,number-of-regions)
- Specifies a split algorithm that evenly divides the number of possible keys by the number of
regions. This clause is useful when the keys are relatively uniform. Do not use SPLITALGO with
NUMREGION in the same statement. Do not use ADD SALT with NUMREGION.
You can specify the range start and range end values as a number or a string literal value of the first column in the HBase row key. The Examples section shows how to use the NUMREGION clause.
- SPLITS_FILE filepath
- Specifies a file that contains split point values, one value per line. The path must be a fully qualified distributed file system (DFS) path. The contents of this file are processed as though the SPLITS clause were specified.
- DEFAULT COLUMN FAMILY OPTIONS
- This parameter specifies the HBase configuration options that are applied, by default, to all column families that are defined for the HBase table. You can define or override these options for individual column families by using the COLUMN FAMILY OPTIONS clause.
- COLUMN FAMILY OPTIONS (hbase-cfamily hbase-cfamily-opts)
- Specifies the HBase configuration options that are applied to a specific column family. You
specify the options in the form of option name/value pairs. In the case of a Boolean option, you can
omit the option value to specify true.
- Specifies a compression algorithm for the column family. Valid values are NONE (default), GZ (GZIP), SNAPPY, and LZ4.
- BLOOM FILTER
- Bloom filters help to identify whether a specific row and column combination exists in a block
without having to load it. Bloom filters use extra space but can improve lookup times and help to
reduce block cache churn by loading only the blocks that are required.
- Specifies that no bloom filter is maintained. This is the default.
- Specifies that the hash of the row is added to the bloom filter on each key insert. Use this option when queries have predicates on row keys.
- Specifies that the hash of the row plus column family plus column family qualifier is added to the bloom filter on each key insert. Use this option when queries have column projections.
- IN MEMORY
- Specifies whether data is to be cached for as long as possible to improve performance. Valid values are true and false (default).
- NO IN MEMORY
- This clause is retained for compatibility with prior releases. Use the IN MEMORY clause instead.
- Specifies whether storage memory blocks are to be cached. Block caching can improve read performance. Valid values are true and false. If block caching is enabled and some blocks have to be evicted to free up space, the column family blocks for whom IN MEMORY is set to true are the last blocks to be evicted.
- Specifies the memory block size (in bytes). This value is used to control how much data HBase is required to read from storage files during retrieval and what is cached in memory for subsequent access. The default is 64 KB.
- Specifies in-cache key compression, which enables you to take advantage of sorted key redundancy in an HFile block. You can choose to store only the differences between consecutive keys.
- Specifies that deleted cells can be retrieved with a GET or SCAN operation, as long as these operations have a specified time range that ends before the deletion timestamp. This allows for point-in-time queries even in the presence of delete operations. Valid values are true and false. The default is false.
- An integer value that specifies the number of different versions of column values to store in the HBase table. The default is 1. Db2 Big SQL reads only the latest version of a column value.
- Specifies the minimum number of row versions to keep for each column family. The default is 0, which disables the option. If the TTL clause is set to a specific value, the data is deleted after the TTL value is passed. You can use the MIN_VERSIONS clause if you want to retain at least a few versions of the data.
- Specifies whether replication is to be enabled on the column family to minimize the effects of system failure. The default is 0, which means that the column family is not replicated. A value of 1 means that the column family is replicated.
- Specifies a time to live (TTL) value (in seconds) for the column family. HBase automatically deletes all versions of the row (including the current version) that have exceeded the specified TTL value. The default is forever.
- HBASE TABLE NAME table-name
- Specifies the name of the underlying HBase table. By using this clause, you can give a different
name to a table in Db2 Big SQL. HBase table names are case sensitive. For example, if you specify
HBASE TABLE NAME hbasetable1, but the actual HBase table name is HBaseTable1, the statement returns an error message because the specified HBase table does not exist.
If your HBase table names differ only in case, you can use this clause to differentiate them in Db2 Big SQL. For example, if your HBase table is named tab1, you can refer to that table as tab1 in Db2 Big SQL. A second HBase table named TAB1 can be referred to as tab1_2 in Db2 Big SQL by using the EXTERNAL clause.
If the HBASE TABLE NAME clause is not specified, the name of the HBase table is schema.tablename, where the schema and the table name are defined in Db2 Big SQL. These names are transformed, if necessary, to satisfy Hive rules for identifiers. For more information about identifiers in Hive, see Identifiers.
If you create an external HBase table on top of an existing HBase table by using the HBASE TABLE NAME clause, ensure that characters for user names and group names are from the set of lowercase characters a through z, the numbers 0 through 9, and the underscore character ( _ ). Names cannot begin with a number.
- DEFAULT ENCODING hbase-encoding
- Specifies the default data encoding that is to be used by any column mapping that does not explicitly specify one (see ENCODING). The default is binary encoding.
- Creates a table that is based on a SELECT statement.
- (column-name1, column-name2, ...)
- You can optionally rename the columns in the target table.
- AS (full-select)
- Specifies the SELECT statement on which the target table is based.
- WITH NO DATA
- Specifies that the target table is to be created without data.
- If you create a table in a schema that does not exist, the schema is automatically created with the default name, which is your user ID.
- Because the STRING data type has no specific size associated with it, the SQL processor might assume that each value in a STRING column always contains 32 KB of data, and performance might be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: run the SET HADOOP PROPERTY command, or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
- If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(9), which is compatible with the maximum precision of a TIMESTAMP value in Hive.
If only the /apps/hbase/data directory is encrypted, INSERT...SELECT or LOAD INTO HBASE does not work.
- Do not use an ARRAY, ROW, or STRUCT data type value as the row key.
- Do not use an ARRAY, ROW, or STRUCT data type value in a constraint or key definition (SQLTATE 42962).
- Do not reference a table that contains a column of type ARRAY, ROW, or STRUCT in the AS clause (SQLSTATE 428H2).
- There is a limit of 32,763 ARRAY, ROW, or STRUCT definitions per database (SQLSTATE 54035). This limit also includes any user-defined types that use the CREATE TYPE statement.
- Comparisons or assignments among ARRAY, ROW, or STRUCT variables, or that use ARRAY data types as parameters, or ARRAY or ROW parameters to a stored procedure, are not supported with ARRAY columns.
- Do not use an ARRAY data type as a type within an ARRAY type.
- The result type of a column in a fullselect cannot be an ARRAY type.
- Columns that are defined with the ARRAY or ROW data type cannot be used as the anchor object of an anchored type (SQLSTATE 428HS).
The use of scrollable cursors for HBase tables is not supported.
- Create an HBase table with composite keys and dense columns. In this example, the HBase table
has one column family (cf1) with two columns (cq1 and cq2). The HBase row key is mapped to three SQL
columns (composite key). The encoding is binary. The HBase column cf1:cq1 is mapped to two SQL
columns (dense column). The encoding in this case is string with '\b' as the terminator character.
The HBase column cf1:cq2 is mapped to a single SQL column. The encoding is
CREATE HBASE TABLE mixed_encodings ( c1 INT, c2 INT, c3 INT, c4 VARCHAR(20), c5 VARCHAR(40), c6 VARCHAR(90) ) COLUMN MAPPING ( KEY MAPPED BY (c1, c2, c3), cf1:cq1 MAPPED BY (c4, c5) ENCODING DELIMITED FIELDS TERMINATED BY '\b', cf1:cq2 MAPPED BY (c6) ENCODING BINARY ) DEFAULT ENCODING BINARY;
- Create an HBase table whose columns are all part of the primary
CREATE HBASE TABLE allkey ( k1 VARCHAR(10), k2 VARCHAR(70), k3 VARCHAR(30) ) COLUMN MAPPING ( KEY MAPPED BY (k1, k2, k3) );
- Create an HBase table with column
CREATE HBASE TABLE colopt1 ( akey VARCHAR(10), c1 VARCHAR(70), c2 VARCHAR(30) ) COLUMN MAPPING ( KEY MAPPED BY (akey), cf1:c1 MAPPED BY (c1), cf2:c2 MAPPED BY (c2) ) COLUMN FAMILY OPTIONS ( cf1 COMPRESSION(GZ) BLOOM FILTER(ROW) IN MEMORY, cf2 NO IN MEMORY );
- Create an HBase table by using SerDe
CREATE EXTERNAL HBASE TABLE tableName ( c1 BIGINT, c2 INT, c3 INT ) COLUMN MAPPING ( KEY MAPPED BY (c1), cf1:cq1 MAPPED BY (c2, c3) ENCODING USING SERDE 'com.example.SerDeClass' WITH SERDEPROPERTIES ('key1'='value1','key2'='value2') );
- Create an HBase table by using salting techniques to spread out any hot spots in the range. New
records are split into multiple buckets, and each record goes to a different region in the cluster.
This example implicitly creates four splits, which consist of the three buckets plus
CREATE HBASE TABLE hbtable ( c1 BIGINT, c2 INT, c3 INT ) COLUMN MAPPING ( KEY MAPPED BY (c1), cf1:cq1 MAPPED BY (c2, c3) ) ADD SALT ( (BUCKETS = 3) (CLASS = 'com.sample.SimpleSalt') PROPERTIES = ('padding' = '2') );
- Create an HBase table with default salt options. Db2 Big SQL uses the default salting algorithm,
which defines the number of buckets as the number of region servers that exist at the time the table
CREATE HBASE TABLE mysalt1 ( rowkey INT, c0 INT ) COLUMN MAPPING ( KEY MAPPED BY (rowkey), f:q MAPPED BY (c0) ) ADD SALT;
- Create an HBase table with a combination of salting and splitting to produce seven splits (the
number of explicit splits plus one). You must know your data to use the split option. Db2 Big SQL
does not validate your split
CREATE HBASE TABLE salt_and_explicit_split ( rowkey VARCHAR(40), c0 INT ) COLUMN MAPPING ( KEY MAPPED BY (rowkey), f:q MAPPED BY (c0) ) ADD SALT ( (buckets=3) ) SPLITS ( '0a', '0n', '1a', '1n', '2a', '2n');
- Create an HBase table by using a compound row key with a numeric value and default binary
CREATE HBASE TABLE hbtable ( c1 INT, c2 INT, c3 INT, c4 VARCHAR(10), c5 VARCHAR(20), c6 VARCHAR(30) ) COLUMN MAPPING ( KEY MAPPED BY (c1, c2, c3), cf1:cq1 MAPPED BY (c4, c5) ENCODING DELIMITED FIELDS TERMINATED BY '\b' cf1:cq2 MAPPED BY (c6) ENCODING BINARY ) DEFAULT ENCODING BINARY NUMREGION(10, 30, 5);The table is created with splits on C1. The NUMREGION clause defines the starting point of the range as 10, and the ending point of the range as 30, with 5 regions in which to split. The regions for hbtable are shown in the following table:
Table 1. Example split regions Region Split on C1 values 1 Less than 10 2 10 - 15 3 16 - 21 4 22 - 29 5 30 and greater
If C1 contains character or string values, the split might be less efficient with binary encoding. Be sure to match the encoding with the data type of the row key column for more efficient splits. For numeric row key data, specify binary encoding. For character or string row key data, specify string encoding.
- Create an HBase table by using a custom SerDe that enables you to read data that is stored as
JSON objects, and to convert that data to a JSON string that can be understood by HBase. The content
of your HBase table is derived from JSON
CREATE HBASE TABLE post ( id INT, title STRING, content STRING, created_on TIMESTAMP, updated_on TIMESTAMP, published BOOLEAN ) COLUMN MAPPING ( KEY MAPPED BY (id), cf1:cq1 MAPPED BY (title, content, created_on, updated_on, published) ) DEFAULT ENCODING USING SERDE 'com.ibm.biginsights.bigsql.serde.JSONSerDe';An example insert statement is shown in the following code snippet:
INSERT INTO post VALUES (1, 'Hello world!', 'First post!', '2013-01-01 01:01:01.001', '2013-01-01 01:01:01.001', true);
- Create an HBase table with splits on numeric or string
CREATE HBASE TABLE t1 ( c1 INT, c2 VARCHAR(10), c3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (c1, c2), cf1:c1 MAPPED BY (c3) ) SPLITS (1000, 2000, 3000);
CREATE HBASE TABLE t2 ( c1 DATE, c2 VARCHAR(10), c3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (c1, c2), cf1:c1 MAPPED BY (c3) ) SPLITS ('1970-01-01', '2000-01-01', '2010-01-01');
- Create an HBase table with splits on NUMREGION (numeric or
CREATE HBASE TABLE t1 ( c1 INT, c2 VARCHAR(10), c3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (c1, c2), cf1:c1 MAPPED BY (c3) ) NUMREGION (0, 10000, 20);
CREATE HBASE TABLE t1 ( c1 DATE, c2 VARCHAR(10), c3 VARCHAR(20) ) COLUMN MAPPING ( KEY MAPPED BY (c1, c2), cf1:c1 MAPPED BY (c3) ) NUMREGION ('1970-01-01', '2100-01-01', 20);
- Create an HBase table that is based on another HBase
CREATE HBASE TABLE hb1 ( ... ) ... AS SELECT * FROM hbcopy;
- Create an HBase table that uses complex data types such as STRUCT and
CREATE HBASE TABLE complex_types ( key VARCHAR(5), c1 STRUCT<f1:VARCHAR(5), f2:VARCHAR(5)>, c2 ARRAY<VARCHAR(5)> ) COLUMN MAPPING ( KEY MAPPED BY (key), f:q1 MAPPED BY (c1) ENCODING DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n', f:q2 MAPPED BY (c2) ENCODING DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n') ;Suppose that the following sample data is stored in the HDFS, in a file named data.csv in folder /tmp:
You can use the LOAD HADOOP statement to populate the table with this data, as shown in the following example:
key1,f1v1/f2v1,a1v1/a2v1/a3v1 key2,f1v2/f2v2,a1v2/a2v2/a3v2 key3,f1v3/f2v3,a1v3/a2v3/a3v3
LOAD HADOOP USING FILE URL '/tmp/data.csv' WITH SOURCE PROPERTIES ('field.delimiter'=',','collection.items.delimiter'='/') INTO TABLE complex_types;After the data is loaded, you can query the complex types as individual elements.
SELECT key, c1.f1, c1.f2, c2, c2, c2 FROM complex_types;