File formats supported by Big SQL

The Hadoop environment can read a large number of storage formats. This flexibility is partially because of the INPUTFORMAT and OUTPUTFORMAT classes that you can specify on the CREATE and ALTER table statements and because of the use of installed and customized SerDe classes. The file formats listed here are available either by using explicit SQL syntax, such as STORED AS PARQUETFILE, or by using installed interfaces, such as Avro.

BigSQL generally supports anything that Hadoop handles, including compression types, file formats, and SerDes, among others. The compression types that are listed are recommendations, but does not suggest that other compression types are not supported.

For a comparison of the supported file formats, see Big SQL 3.0 file formats: Usage and performance

Textfile = delimited

The TEXTFILE format is the default storage format for a table. This format indicates that the underlying data is stored in delimited form, with one record per line, with new lines separating individual records.

Delimiters can be provided by using the ROW FORMAT DELIMITED clause in the CREATE TABLE (HADOOP) statement.


...
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ',' ESCAPED BY '\\' 
  LINES TERMINATED BY '\n'  
  NULL DEFINED AS '\N'
...

Termination characters can be specified in one of the following ways:

\t, a tab character.
\r, a carriage return character.
\YYY, where YYY is an octal value representing the desired character.
\xYY, where YY is a hexidecimal value representing the desired character.
\uYYYYYYY, where YYYYYYY is the hexidecimal value of a unicode character representing the desired character.
Any other character is treated as the literal character value.

Your data must not contain delimiter characters. The only exception is that the field terminator can exist within your data, if you specify the ESCAPED BY clause, and these characters are properly escaped. Newlines cannot be escaped.

If an illegal value is provided for a column, such as inserting the value 'a' in a columns defined as INT, that value is treated as a NULL value. If NULLs are not allowed, an error is returned.

You can define a table where the input data contains more columns (fields) in the data file than the table that is being defined. The additional fields in the data are ignored. If there are fewer fields in the input data than columns, NULLs are returned if the columns are declared as nullable, otherwise an error is returned.

You cannot use delimited values, such as "Hello, how are you", where you use quotes to preserve the comma in the data. In TEXTFILE format you must, instead, use an escape character to protect nested delimiters. For example, assume that column C2 is defined as a VARCHAR(100), then you can define the table with the following clause:


...ROW FORMAT DELIMITED 
 FIELDS TERMINATED BY ',' ESCAPED BY '\\'
...

The data can then be inserted for column C2 as Hello\, how are you?.

TEXTFILE is a useful format because it is commonly understood and can be processed with a wide variety of tools. It does, however, require type conversions for all non-character columns, so the format is not as efficient as a binary storage format in which the underlying storage format requires no conversion.

This type does not support compression.

Sequence file

In Hadoop a SequenceFile is a file format that is used to hold arbitrary data that might not otherwise be splittable. For example, with a TEXTFILE, newlines (\n) are used to determine the boundaries of a record, so an arbitrary DFS block can be processed simply by looking for newlines to determine the start and end of a given record. However, if the data in that file is in binary form or is compressed with a compression algorithm that does not maintain markers for a record boundary, then reading an arbitrary block is impossible.

With a SequenceFile, you can store any arbitrary data. The SequenceFile maintains additional metadata to recognize record boundaries and, therefore, any data can be stored in the file and still maintain the ability to be split on record boundaries. There are two types of data storage within a SEQUENCEFILE: BINARY and TEXT data.

BINARY SEQUENCEFILE: A BINARY SEQUENCEFILE stores data in a binary format by using the Hive LazyBinarySerDe. Binary storage has the advantage that the data is stored in a format that requires very little conversion processing while being read, unlike textual storage format. For best performance, use a format such as PARQUETFILE.
TEXT SEQUENCEFILE: A TEXT SEQUENCEFILE stores textual, delimited data within a sequence file format. The primary purpose of the format is to allow for the use of compression algorithms on the textual data that would not otherwise be splittable.

The following compression types are recommended:

snappy
gzip
deflate
bzip2

RC file

Record Columnar files (RC files), are flat files that consist of binary key/value pairs. This format stores columns of a table in a record columnar way. It first partitions rows horizontally into row splits. and then it vertically partitions each row split in a columnar way. The RC file format first stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part.

The RC file format is an efficient and high performing format.

The following compression types are recommended:

snappy
gzip
deflate
bzip2

Big SQL uses the following default SerDe for RC file formats: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe. As a best practice in terms of performance, when you use the RC file format in a CREATE TABLE statement, use the org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe SerDe, as in the following example:


CREATE TABLE my_table (
  i int, s string
  )
  ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
  STORED AS RCFile;

PARQUET file or PARQUETFILE

Parquet is an open-source columnar storage format for Hadoop. It supports efficient compression and encoding schemes.

In LOAD HADOOP and INSERT operations, the following values are set as the default for Parquet format. You can change these values by using the set hadoop property statement before you issue the LOAD statement:

SET HADOOP PROPERTY 'dfs.blocksize' = 268435456; // 256MB
SET HADOOP PROPERTY 'parquet.page.size' = 65536 // 64K
SET HADOOP PROPERTY 'parquet.dictionary.page.size' = 65536 // 64K
SET HADOOP PROPERTY 'parquet.block.size' = 268435456; // 256MB
SET HADOOP PROPERTY 'parquet.enable.dictionary' = 'true';
SET HADOOP PROPERTY 'parquet.compression' = 'SNAPPY';

The following compression types are recommended:

snappy (the default)
gzip

In an environment with Yarn, you should determine how much memory is usable by a node for allocating resources with the yarn.nodemanager.resource.memory-mb parameter. Define the minimum-allocation-mb and maximum-allocation-mb parameters with numbers that make sense for your environment. If the MapReduce job does not explicitly define how much memory to use for each map task or reduce task, it will default to the value in yarn.scheduler.minimum-allocation-mb. Then, examine the default values for these parameters:

mapreduce.map.java.opts
mapreduce.map.memory.mb
mapreduce.reduce.java.opts
mapreduce.reduce.memory.mb

The MapReduce parameters refer to virtual memory. The value depends on the yarn.nodemanager.vmem-pmem-ratio setting. The minimum-allocation-mb value is based on physical memory. If the ratio is too high, it can cause the system to run out of memory. You might want to reduce the value for mapreduce.reduce.memory.mb to be more aligned with the value of yarn.scheduler.minimum-allocation-mb for a query to pass.

Avro file

Avro, an Apache open source project, provides a convenient way to represent complex data structures within an Hadoop environment.

By using an Avro SerDe in your CREATE HADOOP TABLE statement, you can read or write Avro data as Big SQL tables. However, if you create an Avro table, and you do not use a specific set of columns or data types, the following Avro data types are mapped to Big SQL:

Table 1. Avro data types mapped to Big SQL
Avro data type	Big SQL data type
BOOLEAN	SMALLINT
INT	INT
LONG	BIGINT
FLOAT	DOUBLE
DOUBLE	DOUBLE
STRING	VARCHAR(max)
ENUM	VARCHAR(max)

All other data types result in SQL1666N.

The Big SQL schema can be inferred from the Avro schema. You can specify the Avro schema as parameters to an Avro SerDe.

You can use table properties to include an Avro file format, such as avro.schema.literal, or avro.schema.url as in the following examples:

avro.schema.url


...TBLPROPERTIES (
  'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc'
   )
...
;

avro.schema.literal


...TBLPROPERTIES (
  'avro.schema.literal' = 
    '{"namespace": "com.howdy",
      "name": "some_schema",
      "type": "record",
      "fields": [{ "name":"string1","type":"string"}]}'
   )
...
;

The following compression types are recommended:

snappy
gzip
deflate
bzip2

For more information about using the Avro format, see How to succeed using Avro storage format

ORC file

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data. The ORC file format stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel processing of row collections across a cluster.

The ORC file format uses type-specific encoders for each column and divides the file into large stripes. Each stripe uses indexes that allow the Big SQL readers to skip sets of rows that do not satisfy the filter condition. It also contains a footer that contains metadata that includes byte range information for each stripe and type information for the file.

The following compression types are recommended:

zlib (the default)