File formats that are supported by Db2 Big SQL

The Hadoop environment supports a large number of file formats. The formats that are described here are available either by using explicit SQL syntax, such as STORED AS ORC, or by using installed interfaces such as Avro. Columnar storage saves both time and space during big data processing. The ORC and Parquet file formats provide excellent performance advantages when used with Db2® Big SQL.

Optimized Row Columnar (ORC)

The ORC file format provides a highly efficient way to store data. ORC files store collections of rows in a columnar format, which enables parallel processing of row collections across your cluster. As of Db2 Big SQL 5.0.2, the ORC file format is recommended for optimal performance and functionality.

The ORC file format uses type-specific encoders for each column and divides the file into large stripes. Each stripe uses indexes that enable the Db2 Big SQL readers to skip sets of rows that do not satisfy filter conditions. A footer contains metadata that includes byte range information for each stripe and type information for the file. The amount of resources that Db2 Big SQL uses to process ORC files is affected by the ORC stripe size. The recommended stripe size for ORC files, which is determined by the orc.stripe.size property, is 64 MB. Stripe sizes larger than 256 MB should be avoided.

The recommended compression type for this file format is zlib (the default).

Parquet

The Parquet file format is an open source columnar storage format for Hadoop that supports efficient compression and encoding schemes.

During load and insert operations, the following values are set as the default values for the Parquet format. You can change these values by using the SET HADOOP PROPERTY command before you run the LOAD HADOOP statement:
  • SET HADOOP PROPERTY 'dfs.blocksize' = 268435456;
  • SET HADOOP PROPERTY 'parquet.page.size' = 65536;
  • SET HADOOP PROPERTY 'parquet.dictionary.page.size' = 65536;
  • SET HADOOP PROPERTY 'parquet.block.size' = 268435456;
  • SET HADOOP PROPERTY 'parquet.enable.dictionary' = 'true';
  • SET HADOOP PROPERTY 'parquet.compression' = 'SNAPPY';

The recommended compression types for this file format are snappy (the default) and gzip.

Text

The text file format is the default storage format for a table. The underlying data is stored in delimited form with one record per line and new line characters separating individual records.

You can specify delimiters by using the ROW FORMAT DELIMITED clause in the CREATE TABLE (HADOOP) statement. For example:

...
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ',' ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
  NULL DEFINED AS '\N'
...

Generally, your actual data must not contain delimiter characters. The only exception is that field terminator characters can exist within your data if you specify the ESCAPED BY clause and these characters are properly escaped.

If an incompatible column value is provided (if, for example, you attempt to insert the value a into a column that is defined as INT), that value is treated as a null value. If null values are not allowed, an error is returned.

If your input data contains more columns (fields) than the table, the additional fields in the data are ignored. If there are fewer fields in the input data than columns in the table, null values are inserted if the columns are defined as nullable; otherwise, an error is returned.

Use escape characters to preserve commas that are part of the data.

...
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' ESCAPED BY '\\'
...
For example, Hello\, how are you?.

Because the text file format requires type conversion for all non-character columns, it is not as efficient as a binary storage format.

This file format does not support compression.

Avro

Avro, an Apache open source project, provides a convenient way to represent complex data structures within the Hadoop environment. By using an Avro SerDe in your CREATE TABLE (HADOOP) statement, you can read or write Avro data as Db2 Big SQL tables. The following Avro data types are mapped to Db2 Big SQL data types:
Table 1. Avro data types mapped to Db2 Big SQL data types
Avro data type Db2 Big SQL data type
BOOLEAN BOOLEAN
INT INT
LONG BIGINT
FLOAT DOUBLE
DOUBLE DOUBLE
STRING VARCHAR(max)
ENUM VARCHAR(max)
The use of any other data type returns SQL1666N.

The Db2 Big SQL schema can be inferred from the Avro schema. You can specify the Avro schema as parameters to an Avro SerDe.

You can use table properties to include an Avro file format, such as 'avro.schema.url' or 'avro.schema.literal', as is shown in the following examples:

...TBLPROPERTIES (
  'avro.schema.url' = 'file:///path/to/the/schema/test_serializer.avsc'
   )
...
;

...TBLPROPERTIES (
  'avro.schema.literal' = 
    '{"namespace": "com.howdy",
      "name": "some_schema",
      "type": "record",
      "fields": [{ "name":"string1","type":"string"}]}'
   )
...
;

There is a Hive limit of 4000 bytes for the 'avro.schema.literal'. If your schema is longer than 4000 bytes, store it in a file and use 'avro.schema.url' to refer to it.

Support for Db2 Big SQL and Hive schema evolution with tables in the Avro file format is primarily driven by the SerDe's understanding of the Avro table definition. When you make changes to the schema, it is critical that you keep the column definitions and the Avro schema in synchrony.

Consider a table in the Avro file format that requires a new column. In Db2 Big SQL, the following ALTER TABLE (HADOOP/HBASE) statement adds a new column:

ALTER TABLE...
  ADD COLUMN col3 INT;
Because the Db2 Big SQL column definitions and the Avro schema need to be kept in synchrony to ensure that data is written and read with the new Avro SerDe definition, the table property 'avro.schema.literal' or the HDFS file (containing the JSON schema) that is referenced by 'avro.schema.url' also needs to be updated. For example:

ALTER TABLE...
  SET TBLPROPERTIES (
    'avro.schema.literal' =
      '{\"type\": \"record\",
        \"name\": \"TUPLE_1\",
        \"fields\": [
...
                       {
                         \"name\": \"col3\", \"type\": [\"null\", \"int\"
                       }
                     ]
       }'
  );
You must specify the full new schema. After this statement runs, you can write data to and select data from the altered table, and Db2 Big SQL handles that data according to the new table layout.

Best practice is to not specify a column list; the correct details are then automatically extracted from the specified Avro schema. If you do specify a column list for the table, ensure that the number of columns matches the number of columns in the Avro schema; otherwise, the CREATE TABLE statement returns an error.

The following compression types are recommended:
  • bzip2
  • deflate
  • gzip
  • snappy

For more information, see the Avro storage format.

Record Columnar (RC)

The RC file format uses binary key/value pairs. It partitions rows horizontally into row splits and then partitions each row split vertically. The metadata pertaining to a row split is the key part, and all of the actual data in the row split is stored as the value part of a record.

The following compression types are recommended:
  • bzip2
  • deflate
  • gzip
  • snappy
Db2 Big SQL uses the default org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe SerDe for the RC file format. When you specify the RC file format in a CREATE TABLE statement, best practice is to use the org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe SerDe, as shown in the following example:

CREATE TABLE my_table (
  i INT,
  s STRING)
  ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
  STORED AS RCFILE;
Note: The ORC file format has many advantages over the RC file format, including better compression, better memory utilization, support for all data types (including DATE and DECIMAL), and significantly improved query performance. As such, the ORC file format is recommended for optimal performance and functionality.

Sequence

The sequence file format is used to hold arbitrary data that might not otherwise be splittable. For example, in a text file, newline characters (\n) are used to determine the boundaries of a record, so a DFS block can be processed simply by looking for newline characters. However, if the data in that file is in binary form or is compressed with an algorithm that does not maintain markers for a record boundary, reading that block is impossible.

A sequence file maintains additional metadata to recognize record boundaries. There are two types of sequence files:
  • A binary sequence file stores data in a binary format by using the Hive LazyBinarySerDeSerDe. With binary storage, the data requires very little conversion processing while being read.
  • A text sequence file stores delimited data within the sequence file format, which enables the use of compression algorithms on textual data that would not otherwise be splittable.
The following compression types are recommended:
  • bzip2
  • deflate
  • gzip
  • snappy