File formats supported by Big SQL
The Hadoop environment can read a large number of storage formats. This flexibility is partially because of the INPUTFORMAT and OUTPUTFORMAT classes that you can specify on the CREATE and ALTER table statements and because of the use of installed and customized SerDe classes. The file formats listed here are available either by using explicit SQL syntax, such as STORED AS PARQUETFILE, or by using installed interfaces, such as Avro.
BigSQL generally supports anything that Hadoop handles, including compression types, file formats, and SerDes, among others. The compression types that are listed are recommendations, but does not suggest that other compression types are not supported.
For a comparison of the supported file formats, see Big SQL 3.0 file formats: Usage and performance
Textfile = delimited
The TEXTFILE format is the default storage format for a table. This format indicates that the underlying data is stored in delimited form, with one record per line, with new lines separating individual records.
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LINES TERMINATED BY '\n'
NULL DEFINED AS '\N'
...
Termination
characters can be specified in one of the following ways:- \t, a tab character.
- \r, a carriage return character.
- \YYY, where YYY is an octal value representing the desired character.
- \xYY, where YY is a hexidecimal value representing the desired character.
- \uYYYYYYY, where YYYYYYY is the hexidecimal value of a unicode character representing the desired character.
- Any other character is treated as the literal character value.
Your data must not contain delimiter characters. The only exception is that the field terminator can exist within your data, if you specify the ESCAPED BY clause, and these characters are properly escaped. Newlines cannot be escaped.
If an illegal value is provided for a column, such as inserting the value 'a' in a columns defined as INT, that value is treated as a NULL value. If NULLs are not allowed, an error is returned.
You can define a table where the input data contains more columns (fields) in the data file than the table that is being defined. The additional fields in the data are ignored. If there are fewer fields in the input data than columns, NULLs are returned if the columns are declared as nullable, otherwise an error is returned.
...ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\'
...
The
data can then be inserted for column C2 as Hello\, how are you?
.TEXTFILE is a useful format because it is commonly understood and can be processed with a wide variety of tools. It does, however, require type conversions for all non-character columns, so the format is not as efficient as a binary storage format in which the underlying storage format requires no conversion.
This type does not support compression.
Sequence file
In Hadoop a SequenceFile is a file format that is used to hold arbitrary data that might not otherwise be splittable. For example, with a TEXTFILE, newlines (\n) are used to determine the boundaries of a record, so an arbitrary DFS block can be processed simply by looking for newlines to determine the start and end of a given record. However, if the data in that file is in binary form or is compressed with a compression algorithm that does not maintain markers for a record boundary, then reading an arbitrary block is impossible.
- BINARY SEQUENCEFILE
- A BINARY SEQUENCEFILE stores data in a binary format by using the Hive LazyBinarySerDe. Binary storage has the advantage that the data is stored in a format that requires very little conversion processing while being read, unlike textual storage format. For best performance, use a format such as PARQUETFILE.
- TEXT SEQUENCEFILE
- A TEXT SEQUENCEFILE stores textual, delimited data within a sequence file format. The primary purpose of the format is to allow for the use of compression algorithms on the textual data that would not otherwise be splittable.
- snappy
- gzip
- deflate
- bzip2
RC file
Record Columnar files (RC files), are flat files that consist of binary key/value pairs. This format stores columns of a table in a record columnar way. It first partitions rows horizontally into row splits. and then it vertically partitions each row split in a columnar way. The RC file format first stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part.
The RC file format is an efficient and high performing format.
- snappy
- gzip
- deflate
- bzip2
CREATE TABLE my_table (
i int, s string
)
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFile;
PARQUET file or PARQUETFILE
Parquet is an open-source columnar storage format for Hadoop. It supports efficient compression and encoding schemes.
set hadoop property
statement before you issue the LOAD statement:- SET HADOOP PROPERTY 'dfs.blocksize' = 268435456; // 256MB
- SET HADOOP PROPERTY 'parquet.page.size' = 65536 // 64K
- SET HADOOP PROPERTY 'parquet.dictionary.page.size' = 65536 // 64K
- SET HADOOP PROPERTY 'parquet.block.size' = 268435456; // 256MB
- SET HADOOP PROPERTY 'parquet.enable.dictionary' = 'true';
- SET HADOOP PROPERTY 'parquet.compression' = 'SNAPPY';
- snappy (the default)
- gzip
- mapreduce.map.java.opts
- mapreduce.map.memory.mb
- mapreduce.reduce.java.opts
- mapreduce.reduce.memory.mb
Avro file
Avro, an Apache open source project, provides a convenient way to represent complex data structures within an Hadoop environment.
Avro data type | Big SQL data type |
---|---|
BOOLEAN | SMALLINT |
INT | INT |
LONG | BIGINT |
FLOAT | DOUBLE |
DOUBLE | DOUBLE |
STRING | VARCHAR(max) |
ENUM | VARCHAR(max) |
The Big SQL schema can be inferred from the Avro schema. You can specify the Avro schema as parameters to an Avro SerDe.
- avro.schema.url
-
...TBLPROPERTIES ( 'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc' ) ... ;
- avro.schema.literal
-
...TBLPROPERTIES ( 'avro.schema.literal' = '{"namespace": "com.howdy", "name": "some_schema", "type": "record", "fields": [{ "name":"string1","type":"string"}]}' ) ... ;
- snappy
- gzip
- deflate
- bzip2
For more information about using the Avro format, see How to succeed using Avro storage format
ORC file
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data. The ORC file format stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel processing of row collections across a cluster.
The ORC file format uses type-specific encoders for each column and divides the file into large stripes. Each stripe uses indexes that allow the Big SQL readers to skip sets of rows that do not satisfy the filter condition. It also contains a footer that contains metadata that includes byte range information for each stripe and type information for the file.
- zlib (the default)