| Adding files to HDFS |
- Provides access to data from an external location
- Lets you work with data in HDFS
|
- Fastest method
- Gives you the flexibility to use any ingestion technique that is available for Hadoop
- You can use the distcp tool to quickly copy or move files on HDFS (see DistCp Guide)
|
|
| Syncing the Db2 Big SQL and Hive catalogs |
- Lets you work with data from Hive or another third-party tool that synchronizes with the Hive
metastore
|
- Data that is already in Hive does not have to be added to Db2 Big SQL
|
- String data types can lead to wide tables in Db2 Big SQL
(see STRING)
|
| Using the INSERT...SELECT
statement |
- Enables data transformations such as, for example, querying a table with text format data and
then inserting that data into a table with a more compact file format, such as ORC or Parquet
|
|
- Not recommended for inserting large quantities of data in bulk
|
| Using the LOAD HADOOP statement |
- An all-purpose load utility for populating Db2 Big SQL
tables
|
- Faster than INSERT...SELECT when ingesting large quantities of data
- Can load data from external sources such as JDBC, SFTP, HDFS, and S3
- This approach is based on MapReduce, so YARN manages the resources
- Data is validated as it is added, and bad records are rejected
- Some data transformation can be done to conform to target data types
|
- Loading tables that are partitioned by a column expression is limited to JDBC source
queries
- Not as many transformations are available as with INSERT...SELECT
|
| CREATE TABLE...AS |
- Using the INSERT...SELECT statement
under the covers , creates a table that is similar to
another table with, for example, a different file format
|
- Enables data transformations
- Gives you the flexibility to use remote data sources and nicknames (federated data sources)
|
- Not recommended for ingesting large quantities of data in bulk
|
| Sqoop |
- Most prominent tool for moving relational data in a Hadoop system
|
- Many JDBC connectors are provided
- Faster than INSERT...SELECT
- Some data validation is done to prevent bad data from being added to a table
- You can import compressed tables into Hive
|
- Limited to loading data from a JDBC source
- Does not support dynamic partitioning
- Does not transform data
|
|
Important: Avoid using any data ingestion technique that results in many small files,
because this can have a negative impact on query performance.
|