File compaction
As modifications are made to a transactional Hadoop table, new delta file directories are created in the table's storage location. To prevent the number of delta files from increasing indefinitely and potentially impacting query performance, a process called compaction can be used to rewrite and amalgamate the contents of these files.
'NO_AUTO_COMPACTION'='true' when you create or alter the table. When compaction
will not interfere with other statements, you can manually trigger a compaction operation by running
an ALTER TABLE...COMPACT...AND WAIT statement. For
example:CREATE HADOOP TABLE TESTACID.ACIDTBL (
A INT,
B STRING
)
TBLPROPERTIES ('transactional'='true', 'no_auto_compaction'='true')
STORED AS ORC;- Minor compaction rewrites a set of delta directories as a single delta directory.
- Major compaction rewrites a set of delta and base directories as a single base directory.
When run with the AND WAIT option, compaction might not immediately remove obsolete directories. Subsequent compaction runs without the AND WAIT option will remove obsolete directories.
For more information, see Managing Apache Hive: Data compaction.
Custom configuration for specific tables
- no_auto_compaction: Specifies that automatic compaction for the table is to be disabled (default is false).
- compactorthreshold.hive.compactor.delta.num.threshold: Specifies the number of delta directories in a table or partition that will trigger a minor compaction operation. This overrides the system configuration parameter hive.compactor.delta.num.threshold.
- compactorthreshold.hive.compactor.delta.pct.threshold: Specifies the fractional size of delta directories relative to the base directories that will trigger a major compaction operation. This value, in the range from 0 to 1, overrides the system configuration parameter hive.compactor.delta.pct.threshold.
Automatic compaction
- hive.compactor.check.interval: Specifies the time interval, in seconds, between checks to determine whether any tables or partitions need to be compacted (default is 300).
- hive.compactor.delta.num.threshold: Specifies the number of delta directories in a table or partition that will trigger a minor compaction operation (default is 10) .
- hive.compactor.delta.pct.threshold: Specifies the fractional size of delta files relative to the base that will trigger a major compaction operation (default is 0.1). This value must be in the range from 0 to 1. A setting of 0.1 will trigger a major compaction operation when the size of delta files is greater than 10% that of the base files.
- hive.compactor.abortedtxn.threshold: Specifies the number of aborted transactions that will trigger a major compaction operation (default is 1000).
- hive.compactor.history.retention.failed: Specifies the number of failed compaction records per table or partition to retain in the compaction history (default is 3).
- hive.compactor.initiator.failed.compacts.threshold: Specifies the number of failed compaction operations after which no further automatic compaction operations are triggered (default is 2). The value must be less than the value of the hive.compactor.history.retention.failed parameter.
If automatic compaction has stopped because of successive failures, you can run a manual compaction operation by using the ALTER TABLE (HADOOP/HBASE) statement. If the manual compaction operation succeeds, automatic compaction will resume.
Manual compaction
If compaction is required on a table for which automatic compaction is disabled, or if you want to trigger a compaction operation when the system has not yet done so automatically, you can trigger a compaction operation manually by using the ALTER TABLE (HADOOP/HBASE) statement.
Monitor compaction progress and history
Use the SHOW_HIVE_COMPACTIONS stored procedure to monitor the progress of a compaction operation or to review the results of previous compaction operations.
Examples
- Use the ALTER TABLE (HADOOP/HBASE) statement to trigger a major compaction
operation:
ALTER TABLE <table-name> COMPACT MAJOR; - Use the ALTER TABLE (HADOOP/HBASE) statement to trigger a minor compaction operation, and wait
for it to complete:
ALTER TABLE <table-name> COMPACT MINOR AND WAIT; - Disable automatic compaction for a specific
table:
ALTER TABLE <table-name> SET TBLPROPERTIES ('no_auto_compaction'='true'); - Trigger a major compaction operation and modify table properties by using the ALTER TABLE
(HADOOP/HBASE) statement. Re-enable automatic compaction for the table and trigger a compaction
operation if the size of the delta files grows to greater than 50% that of the base data
files:
ALTER TABLE <table-name> COMPACT MAJOR AND WAIT WITH OVERWRITE TBLPROPERTIES ( 'no_auto_compaction'='false', 'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5' ); - Show current and historical
compactions:
CALL SYSHADOOP.SHOW_HIVE_COMPACTIONS;