Data ingestion techniques

You can use various methods to ingest data into Db2® Big SQL. The method that you use, the size of the data files, and the file format can all have an impact on query performance.

The following table summarizes the uses, strengths, and limitations of different approaches to ingesting data into Db2 Big SQL.

Table 1. Data ingestion methods
Method Uses Strengths Limitations
Adding files to HDFS
  • Provides access to data from an external location
  • Lets you work with data in HDFS
  • Fastest method
  • Gives you the flexibility to use any ingestion technique that is available for Hadoop
  • You can use the distcp tool to quickly copy or move files on HDFS (see DistCp Guide)
Syncing the Db2 Big SQL and Hive catalogs
  • Lets you work with data from Hive or another third-party tool that synchronizes with the Hive metastore
  • Data that is already in Hive does not have to be added to Db2 Big SQL
  • String data types can lead to wide tables in Db2 Big SQL (see STRING)
Using the INSERT...SELECT statement
  • Enables data transformations such as, for example, querying a table with text format data and then inserting that data into a table with a more compact file format, such as ORC or Parquet
  • Not recommended for inserting large quantities of data in bulk
  • Not recommended for very wide tables
Using the LOAD HADOOP statement
  • An all-purpose load utility for populating Db2 Big SQL tables
  • Faster than INSERT...SELECT when ingesting large quantities of data
  • Can load data from external sources such as JDBC, SFTP, HDFS, and S3
  • This approach is based on MapReduce, so YARN manages the resources
  • Data is validated as it is added, and bad records are rejected
  • Some data transformation can be done to conform to target data types
  • Loading tables that are partitioned by a column expression is limited to JDBC source queries
  • Not as many transformations are available as with INSERT...SELECT
CREATE TABLE...AS
  • Using the INSERT...SELECT statement under the covers, creates a table that is similar to another table with, for example, a different file format
  • Enables data transformations
  • Gives you the flexibility to use remote data sources and nicknames (federated data sources)
  • Not recommended for ingesting large quantities of data in bulk
  • Not recommended for very wide tables
Sqoop
  • Most prominent tool for moving relational data in a Hadoop system
  • Many JDBC connectors are provided
  • Faster than INSERT...SELECT
  • Some data validation is done to prevent bad data from being added to a table
  • You can import compressed tables into Hive
  • Limited to loading data from a JDBC source
  • Does not support dynamic partitioning
  • Does not transform data
Important: Avoid using any data ingestion technique that results in many small files, because this can have a negative impact on query performance.
Best practices:
Overview of data ingestion techniques
The following article presents an overview of various ingestion techniques and explains some best practices for considering number of files, file sizes, and placement of files:
Specific ingestion techniques
Specific ingestion techniques and best practices are described in the following articles:
Additional documentation
See the following Knowledge Center topics: