File compaction

As modifications are made to a transactional Hadoop table, new delta file directories are created in the table's storage location. To prevent the number of delta files from increasing indefinitely and potentially impacting query performance, a process called compaction can be used to rewrite and amalgamate the contents of these files.

When auto compaction is enabled, a transactional table can be compacted by Hive even when other statements are running against that table. If the compactor is making changes while a query is running, errors such as FileNotFoundException might be returned. Therefore, it is recommended that auto compaction be disabled for such a table by adding the table property 'NO_AUTO_COMPACTION'='true' when you create or alter the table. When compaction will not interfere with other statements, you can manually trigger a compaction operation by running an ALTER TABLE...COMPACT...AND WAIT statement. For example:
CREATE HADOOP TABLE TESTACID.ACIDTBL (
  A INT,
  B STRING
)
  TBLPROPERTIES ('transactional'='true', 'no_auto_compaction'='true')
  STORED AS ORC;
Compaction typically takes place as a background process and does not prevent concurrent reads and writes on the data. There are two types of compaction:
  • Minor compaction rewrites a set of delta directories as a single delta directory.
  • Major compaction rewrites a set of delta and base directories as a single base directory.

When run with the AND WAIT option, compaction might not immediately remove obsolete directories. Subsequent compaction runs without the AND WAIT option will remove obsolete directories.

For more information, see Managing Apache Hive: Data compaction.

Custom configuration for specific tables

If compaction is enabled, you can use table properties to customize the triggering rules or to turn off automatic compaction for specific tables:
  • no_auto_compaction: Specifies that automatic compaction for the table is to be disabled (default is false).
  • compactorthreshold.hive.compactor.delta.num.threshold: Specifies the number of delta directories in a table or partition that will trigger a minor compaction operation. This overrides the system configuration parameter hive.compactor.delta.num.threshold.
  • compactorthreshold.hive.compactor.delta.pct.threshold: Specifies the fractional size of delta directories relative to the base directories that will trigger a major compaction operation. This value, in the range from 0 to 1, overrides the system configuration parameter hive.compactor.delta.pct.threshold.

Automatic compaction

If enabled, compaction will trigger automatically as a background process, when necessary without any user intervention. You can adjust the frequency and thresholds for background compactions by setting the following Hive-on-Tez server configuration parameters:

  • hive.compactor.check.interval: Specifies the time interval, in seconds, between checks to determine whether any tables or partitions need to be compacted (default is 300).
  • hive.compactor.delta.num.threshold: Specifies the number of delta directories in a table or partition that will trigger a minor compaction operation (default is 10)
.
  • hive.compactor.delta.pct.threshold: Specifies the fractional size of delta files relative to the base that will trigger a major compaction operation (default is 0.1). This value must be in the range from 0 to 1. A setting of 0.1 will trigger a major compaction operation when the size of delta files is greater than 10% that of the base files.
A major compaction operation can also be triggered if there is a large number of aborted transactions involving a specific table or partition:
  • hive.compactor.abortedtxn.threshold: Specifies the number of aborted transactions that will trigger a major compaction operation (default is 1000).
If automatically triggered compaction operations on a table or partition fail repeatedly, automatic compaction for that table or partition is stopped. The number of successive failures that are required before automatic compaction is stopped is determined by the following configuration parameters:

  • hive.compactor.history.retention.failed: Specifies the number of failed compaction records per table or partition to retain in the compaction history (default is 3).
  • hive.compactor.initiator.failed.compacts.threshold: Specifies the number of failed compaction operations after which no further automatic compaction operations are triggered (default is 2). The value must be less than the value of the hive.compactor.history.retention.failed parameter.

If automatic compaction has stopped because of successive failures, you can run a manual compaction operation by using the ALTER TABLE (HADOOP/HBASE) statement. If the manual compaction operation succeeds, automatic compaction will resume.

Manual compaction

If compaction is required on a table for which automatic compaction is disabled, or if you want to trigger a compaction operation when the system has not yet done so automatically, you can trigger a compaction operation manually by using the ALTER TABLE (HADOOP/HBASE) statement.


Monitor compaction progress and history

Use the SHOW_HIVE_COMPACTIONS stored procedure to monitor the progress of a compaction operation or to review the results of previous compaction operations.

Examples

  • Use the ALTER TABLE (HADOOP/HBASE) statement to trigger a major compaction operation:
    ALTER TABLE <table-name> COMPACT MAJOR;

  • Use the ALTER TABLE (HADOOP/HBASE) statement to trigger a minor compaction operation, and wait for it to complete:
    ALTER TABLE <table-name> COMPACT MINOR AND WAIT;
  • Disable automatic compaction for a specific table:
    ALTER TABLE <table-name> SET TBLPROPERTIES ('no_auto_compaction'='true');
  • Trigger a major compaction operation and modify table properties by using the ALTER TABLE (HADOOP/HBASE) statement. Re-enable automatic compaction for the table and trigger a compaction operation if the size of the delta files grows to greater than 50% that of the base data files:
    ALTER TABLE <table-name> COMPACT MAJOR AND WAIT
      WITH OVERWRITE TBLPROPERTIES (
        'no_auto_compaction'='false',
        'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5'
      );
  • Show current and historical compactions:
    CALL SYSHADOOP.SHOW_HIVE_COMPACTIONS;