Supported file formats for Datalake tables
There are many file formats supported in open source. The described file formats are the ones that are supported in Db2® Warehouse on Cloud and Db2 Warehouse on OpenShift. Columnar storage saves both time and space during data processing, making ORC and Parquet the preferred file formats to use.
Parquet
The Parquet file format is an open source columnar storage format. Parquet is highly efficient for types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table.
This is the default file format supported for Datalake tables in Db2.
The recommended compression types for this file format are snappy (the default) and gzip.
Optimized Row Columnar (ORC)
The ORC file format provides a highly efficient way to store data. ORC files store collections of rows in a columnar format, which enables parallel processing of row collections across your cluster.
The ORC file format uses type-specific encoders for each column and divides the file into large stripes. Each stripe uses indexes that enable the Db2 readers to skip sets of rows that do not satisfy filter conditions. A footer contains metadata that includes byte range information for each stripe and type information for the file. The amount of resources that Db2 uses to process ORC files is affected by the ORC stripe size. The recommended stripe size for ORC files, which is determined by the orc.stripe.size property, is 64 MB. Stripe sizes larger than 256 MB should be avoided.
For details about using the ORC file format with historical DATE and TIMESTAMP values, see Data types supported by Datalake tables.
The recommended compression type for this file format is zlib (the default).
Avro
Avro, an Apache open source project, provides a convenient way to represent complex data structures within the Hadoop environment. By using an Avro SerDe in your CREATE TABLE (Datalake) statement, you can read or write Avro data as Datalake tables.
Avro data type | Db2 data type | avro.schema.literal |
---|---|---|
LONG | BIGINT |
{ "name": "<COL_NAME>", "type": "long" }, |
BYTES | BINARY, VIBINARY |
{ "name": "<COL_NAME>", "type": "bytes" }, |
BOOLEAN | BOOLEAN |
{ "name": "<COL_NAME>", "type": "boolean" }, |
STRING | VARCHAR, CHAR, STRING |
{ "name": "<COL_NAME>", "type": "string" }, |
DOUBLE | DOUBLE, FLOAT8 |
{ "name": "<COL_NAME>", "type": "double" }, |
FLOAT | FLOAT, FLOAT4, REAL |
{ "name": "<COL_NAME>", "type": "float" }, |
INT | INTEGER, INT, INT4, SMALLINT, INT2, TINYINT |
{ "name": "<COL_NAME>", "type": "int" }, |
Text
The text file format is common because it is easy to use and manage. Data is stored in lines, with each line being a record. Each line is terminated by a newline character (\n).
By default, the field terminator for text file format is \001, the ASCII character that represents CTRL-A. If you want to use CSV files, the EXTERNAL TABLE support is recommended over the Datalake TABLE support. See, CREATE EXTERNAL TABLE statement.
Although this file format is the default in Hive, this is not the case for Db2 Datalake tables, where the default is Parquet.
If an incompatible column value is provided (if, for example, you attempt to insert the value
a
into a column that is defined as INT), an error is returned.
If the data stored in the file for the table contains an incompatible column value, that value is treated as a null value. If null values are not allowed, an error is returned.
If your input data contains more columns (fields) than the table, the additional fields in the data are ignored. If there are fewer fields in the input data than columns in the table, null values are inserted if the columns are defined as nullable; otherwise, an error is returned.
Generally, your actual data must not contain delimiter characters. The only exception is that field terminator characters can exist within your data if you specify the ESCAPED BY clause and these characters are properly escaped. Newline characters cannot be escaped.
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\'
...
This file format does not support compression.
JSON
The JSON file format is a popular open format for data exchange, web application development, and application logs. You can query data in valid JSON LINE format by specifying STORED AS JSONFILE. Each row must be represented as a JSON fragment, and lines must be separated by a new line character.