Auto-analyze

You can use the auto-analyze feature to automatically run an ANALYZE command on tables when it is determined to be necessary.

When more than 50% of the table data is new, or the table has never been analyzed, the auto-analyze process runs the ANALYZE command. For Db2® Big SQL HBase tables, this also occurs if a major compaction has been done since the last analyze operation. Auto-analyze uses a 10% sampling of the table data. If you have a very large table, you might want to manually run the ANALYZE command, specifying a lower percentage of sampling.

At Db2 Big SQL installation time, auto-analyze is enabled to run every 10 minutes with the default cron schedule. If you need a different schedule, you can modify the default cron schedule by using a DB2® Administrative Task Scheduler (ATS) procedure named SYSPROC.ADMIN_TASK_UPDATE to update the ATS task named BIGSQL_CHECK_ANALYZE. For more information about setting a task cron schedule, see UNIX cron format.

An analyze task is also added to the ATS when auto-analyze detects that an ANALYZE command should be run. These analyze tasks are scheduled to run immediately and only once.

To disable auto-analyze, either modify the cron schedule for the task or remove the ATS task (named BIGSQL_CHECK_ANALYZE) that checks for tables that are to be analyzed.

Impersonation and auto-analyze

If you enable impersonation for Db2 Big SQL, the ANALYZE command runs as the table owner. Otherwise, the ANALYZE command runs as the Db2 Big SQL administrator. The ATS tasks always run as the Db2 Big SQL administrator.

Db2 Big SQL procedures for auto-analyze

Note: SYSHADOOP.BIGSQL_RUN_ANALYZE and SYSHADOOP.BIGSQL_CHECK_ANALYZE are internal procedures that are intended to be used only by auto-analyze.
SYSHADOOP.BIGSQL_RUN_ANALYZE
An analyze task calls this procedure to run the ANALYZE command.
SYSHADOOP.BIGSQL_CHECK_ANALYZE
A scheduled task calls this procedure to determine whether any tables should be analyzed. This procedure has one parameter:
SYSHADOOP.BIGSQL_CHECK_ANALYZE(maxConcurrentTasks)
maxConcurrentTasks
Specifies the maximum number of concurrently running tasks. The value is an integer that is used to override the setting of the biginsights.stats.auto.analyze.concurrent.max property.
SYSHADOOP.BIGSQL_AUTO_ANALYZE_STATUS
A Db2 Big SQL user calls this procedure to check the status of auto-analyze tasks. For details, see BIGSQL_AUTO_ANALYZE_STATUS stored procedure.

Configuration properties for auto-analyze

Use the following properties to set the behavior of auto-analyze.
biginsights.stats.auto.analyze.concurrent.max
Specifies the maximum number of analyze tasks that can be fetched at one time from the scheduler auto-analyze queue. This is equivalent to the number of concurrent ANALYZE commands that can be started by auto-analyze. The number should be kept low to prevent too much resource from being used by auto-analyze, which could impact the performance of other concurrent operations. The default is 1.
biginsights.stats.auto.analyze.newdata.min
Specifies the minimum percentage of new data that must be added to the table before an auto-analyze operation is triggered. The value must be an integer greater than 0. The default is 50, which means that at least 50% of the table data must be new before auto-analyze determines that the table needs to be analyzed.
biginsights.stats.auto.analyze.task.retention.time
Specifies how long completed analyze tasks and their histories are to be kept before they are purged. When histories are purged, the status of completed tasks is no longer available. Valid values are 1WEEK, 1MONTH, FOREVER, or NONE. The default is 1MONTH. When set to NONE, analyze tasks are deleted immediately after completion. When set to FOREVER, completed analyze tasks and their histories are never deleted. You can use the SYSPROC.ADMIN_TASK_REMOVE procedure to periodically purge completed analyze tasks manually.

Auto-analyze after running LOAD HADOOP statements

You can change the behavior of the ANALYZE command after a LOAD HADOOP statement by setting the biginsights.stats.auto.analyze.post.load property. For details, see Configuring automatic ANALYZE after LOAD HADOOP.

Using multiple schedules

There can be multiple tasks with different schedules for calling the SYSHADOOP.BIGSQL_CHECK_ANALYZE procedure, or tasks with different maximum concurrent task settings. These tasks must be added by the Db2 Big SQL administrator. For examples, see Use multiple schedules.

Examples

  • Have auto-analyze run once every day at midnight:
    CALL SYSPROC.ADMIN_TASK_UPDATE(
    'BIGSQL_CHECK_ANALYZE', NULL, NULL, NULL,
    '0 0 * * *', NULL, NULL);
  • Disable the auto-analyze task so that it does not run:
    CALL SYSPROC.ADMIN_TASK_UPDATE(
    'BIGSQL_CHECK_ANALYZE', NULL, NULL, 0, NULL, NULL, NULL);
  • Disable auto-analyze by removing the ATS task that checks the tables (Db2 Big SQL administrator only):
    CALL SYSPROC.ADMIN_TASK_REMOVE(
    'BIGSQL_CHECK_ANALYZE', NULL);
  • Enable auto-analyze by adding an ATS task that checks the tables (Db2 Big SQL administrator only):
    CALL SYSPROC.ADMIN_TASK_ADD(
    'BIGSQL_CHECK_ANALYZE', NULL, NULL, NULL,
    '0,10,20,30,40,50 * * * *', 
    'SYSHADOOP', 'BIGSQL_CHECK_ANALYZE',
    'VALUES(0)', NULL, NULL);
  • Disable auto-analyze by updating the cron schedule of the ATS task (Db2 Big SQL administrator only):
    CALL SYSPROC.ADMIN_TASK_UPDATE(
    'BIGSQL_CHECK_ANALYZE', NULL, NULL, 0, NULL, NULL, NULL);
  • Enable auto-analyze by updating the cron schedule of the ATS task (Db2 Big SQL administrator only). This example sets the cron schedule to check every 10 minutes:
    CALL SYSPROC.ADMIN_TASK_UPDATE(
    'BIGSQL_CHECK_ANALYZE', NULL, NULL, NULL,
    '0,10,20,30,40,50 * * * *', NULL, NULL);
  • Manually purge completed analyze tasks. In this example, the name of the ATS analyze task is Analyze 1458912222 UIUSER.COUNTRY:
    SYSPROC.ADMIN_TASK_REMOVE(
    'Analyze 1458912222 UIUSER.COUNTRY', NULL);
  • Use multiple schedules. The following examples show how to add two SYSHADOOP.BIGSQL_CHECK_ANALYZE tasks with different schedules and different maximum concurrent task settings. The first task (for daytime) runs every 15 minutes from 7:00-18:59 and allows a maximum of one concurrent analyze task. The second task (for nighttime) runs every 30 minutes from 00:00-6:59 and from 19:00-12:59, allowing a maximum of five concurrent analyze tasks.
    Daytime schedule
    CALL SYSPROC.ADMIN_TASK_ADD(
    'BIGSQL_CHECK_ANALYZE days max 1', NULL, NULL, NULL,
    '0,15,30,45 7-18 * * *',
    'SYSHADOOP', 'BIGSQL_CHECK_ANALYZE', 'VALUES(1)', NULL, NULL);
    Nighttime schedule
    CALL  SYSPROC.ADMIN_TASK_ADD(
      'BIGSQL_CHECK_ANALYZE nights max 5',NULL, NULL,  NULL,
          '0,30 0-6,19-23 * * *', 
      'SYSHADOOP','BIGSQL_CHECK_ANALYZE',
      'VALUES(5)',NULL,NULL);