Data ingestion techniques

You can use various methods to ingest data into Db2® Big SQL. The method that you use, the size of the data files, and the file format can all have an impact on query performance.

The following table summarizes the uses, strengths, and limitations of different approaches to ingesting data into Db2 Big SQL.

Table 1. Data ingestion methods
Method	Uses	Strengths	Limitations
Adding files to HDFS	Provides access to data from an external location Lets you work with data in HDFS	Fastest method Gives you the flexibility to use any ingestion technique that is available for Hadoop You can use the distcp tool to quickly copy or move files on HDFS (see DistCp Guide)	Requires an HDFS path No data validation is done You must call the HCAT_CACHE_SYNC stored procedure so that Db2 Big SQL is aware of new files
Syncing the Db2 Big SQL and Hive catalogs	Lets you work with data from Hive or another third-party tool that synchronizes with the Hive metastore	Data that is already in Hive does not have to be added to Db2 Big SQL	String data types can lead to wide tables in Db2 Big SQL (see STRING)
Using the INSERT...SELECT statement	Enables data transformations such as, for example, querying a table with text format data and then inserting that data into a table with a more compact file format, such as ORC or Parquet	Due to the expressiveness of SQL, more transformations are available with INSERT...SELECT than with LOAD HADOOP Best approach for inserting data into tables that are partitioned by column expressions (see Data Ingest with the Partition on Column Expression Feature)	Not recommended for inserting large quantities of data in bulk Not recommended for very wide tables
Using the LOAD HADOOP statement	An all-purpose load utility for populating Db2 Big SQL tables	Faster than INSERT...SELECT when ingesting large quantities of data Can load data from external sources such as JDBC, SFTP, HDFS, and S3 This approach is based on MapReduce, so YARN manages the resources Data is validated as it is added, and bad records are rejected Some data transformation can be done to conform to target data types	Loading tables that are partitioned by a column expression is limited to JDBC source queries Not as many transformations are available as with INSERT...SELECT
CREATE TABLE...AS	Using the INSERT...SELECT statement under the covers, creates a table that is similar to another table with, for example, a different file format	Enables data transformations Gives you the flexibility to use remote data sources and nicknames (federated data sources)	Not recommended for ingesting large quantities of data in bulk Not recommended for very wide tables
Sqoop	Most prominent tool for moving relational data in a Hadoop system	Many JDBC connectors are provided Faster than INSERT...SELECT Some data validation is done to prevent bad data from being added to a table You can import compressed tables into Hive	Limited to loading data from a JDBC source Does not support dynamic partitioning Does not transform data
Important: Avoid using any data ingestion technique that results in many small files, because this can have a negative impact on query performance.

Best practices:

Overview of data ingestion techniques

The following article presents an overview of various ingestion techniques and explains some best practices for considering number of files, file sizes, and placement of files:

Db2 Big SQL Best Practices – Data Ingestion

Specific ingestion techniques

Specific ingestion techniques and best practices are described in the following articles:

Additional documentation

See the following Knowledge Center topics: