Db2 Big SQL scheduler

The scheduler is a service that helps in the execution of queries. It communicates with all of the components in Db2® Big SQL, including the Hive metastore, the HDFS NameNode, the Db2 Big SQL readers and writers, and the Db2 Big SQL engine.

The scheduler provides access to the metadata for Hadoop tables that are involved in a scan. This metadata includes table descriptors with details such as column types, column lengths, partition information, and input format. It evenly distributes the scan of a Hadoop table in your cluster. The scheduler also does partition elimination. It uses system resources efficiently to maximize throughput and minimize response time.

The scheduler assigns splits to readers and commits data insertion. The Db2 Big SQL readers (also called the Db2 Big SQL readers and writers) use the information from the scheduler to do the actual reading of HDFS data.

Check the status of the scheduler by running the following command:
$BIGSQL_HOME/bin/bigsql status -scheduler

Scheduler configuration

You can view the Db2 Big SQL scheduler configuration at $BIGSQL_HOME/conf/bigsql-conf.xml. If you change values in the scheduler configuration file, you must stop and restart the scheduler so that the changes can take effect. To do so, use the following commands:

$BIGSQL_HOME/bin/bigsql stop -scheduler
$BIGSQL_HOME/bin/bigsql start -scheduler
Big SQL is unable to process queries while the scheduler is restarting.
The following properties are the core properties that must be defined in the scheduler configuration file:
The name of the host on which the scheduler is to run. This host is typically the master node. Unless you have a single node environment, the name cannot be localhost.
The port where the scheduler gets administrator commands such as stop or start. The default value is 7054. You can specify any valid port number.
The port where the scheduler gets requests for scheduling work. The default value is 7053. You can specify any valid port number.
The Java options for the scheduler JVM. You can specify any JVM option. The following example sets the starting memory to 512 MB and the maximum memory to 2 GB.
<value>-Xms512M -Xmx2G</value>
Every logical Db2 Big SQL worker has to connect to the Db2 Big SQL scheduler at the start of a scan to get a list of HDFS splits that has to be read for the table involved in the query. The total number of threads used by the scheduler, configured through the scheduler.maxWorkerThreads property in bigsql-conf.xml, is 1024 by default. Depending on the complexity of the query, the number of tables and sub-queries, and the number of table scans done, the number of requests from the Db2 Big SQL workers to scheduler can add up. On clusters configured with many logical worker nodes, featuring complex queries running in parallel, the value for the scheduler.maxWorkerThreads might need to be increased beyond the default.
The number of threads in use can be monitored by periodically running a command such as the following on the host on which the Db2 Big SQL head node resides:
ps -eLf | grep -c bigsql.scheduler.server

If in doubt, there is no harm in increasing the value of the scheduler.maxWorkerThreads property by a factor of two or more, as the scheduler will automatically taper down all the way to a minimum of eight threads, as defined in the scheduler.minWorkerThreads property, when there is no longer any SQL workload running. Doubling the value of the scheduler.maxWorkerThreads property is a good starting point if you determine the default value is likely to be exceeded.

In addition, the scheduler has the following three timeout configuration properties that you can tune:

This property controls the duration of time that the scheduler server will wait for a client to connect and send its communication before timing out. The default value of this property is 60 minutes. Typically, the value of this property does not need to be modified.
Throughout normal workload execution, the Db2 engine components in Db2 Big SQL make requests to the scheduler and then wait for a response. Because the scheduler is running as a different service than the Db2 engine, there is a timeout that enables the Db2 engine to exit early and fail if the scheduler is taking too long to respond. The DDL processor component also uses this property when sending metadata refresh requests to the scheduler. The default value of this property is 120 seconds.
This property is an identical concept as scheduler.client.request.timeout, except that this property is uniquely for insert/update/delete (IUD) queries only. Due to the nature of IUD workloads, the scheduler might take somewhat longer for these types of operations. Therefore, this configuration property provides more granular timeout control depending on the type of query. In other words, it is possible to give more time for IUD queries with respect to scheduler.client.request.timeout, without having to influence the existing timeouts for non-IUD queries. The default value of this property is 600 seconds.

The scheduler.client.request.timeout and scheduler.client.request.IUDEnd.timeout properties might need to be adjusted when there are many files or many partitions involved. Adjustments might be needed because of the appreciable amount of time it takes to load the scheduler cache with the table metadata, including individual partitions and files metadata. In this case, review the diagnostics in the /var/ibm/bigsql/logs/bigsql-sched-recurring-diag-info.log file on the Db2 Big SQL head node host to help determine an appropriate value for these properties. The bigsql-sched-recurring-diag-info.log file shows various useful diagnostic information, including the time spent in the scheduler for various functions recorded in both a "top (5) max elapsed times:" and also a, "elapsed-time-range-in-millis and frequency-of-calls-in-that-range" histogram. The elapsed times are in milliseconds.

There are two functions in particular that are related to scheduler client timeouts, requestScanMetadata and registerQueryNew. The following example shows the elapsed timings recorded in the bigsql-sched-recurring-diag-info.log file for the requestScanMetadata function:

top (5) max elapsed times:
time= 603484; info= [TableSchema(schName:<schema>, tblName:<tablename>, impersonationID:bigsql)]
time= 590778; info= [TableSchema(schName:<schema>, tblName:<tablename>, impersonationID:bigsql)]
time= 448659; info= [TableSchema(schName:<schema>, tblName:<tablename>, impersonationID:bigsql)]
time= 437044; info= [TableSchema(schName:<schema>, tblName:<tablename>, impersonationID:bigsql)]
time= 413360; info= [TableSchema(schName:<schema>, tblName:<tablename>, impersonationID:bigsql)]
elapsed-time-range-in-millis and frequency-of-calls-in-that-range:
range= 0-10; freq= 10956
range= 10-100; freq= 5553
range= 100-1000; freq= 165
range= 1000-10000; freq= 48
range= 10000-100000; freq= 21
range= 100000-1000000; freq= 13
range= 1000000-9223372036854775807; freq= 0

Choose values for the scheduler.client.request.IUDEnd.timeout and scheduler.client.request.timeout properties to be greater than the longest elapsed times recorded under the registerQueryNew and requestScanMetadata functions. In the example above, the largest elapsed time recorded is 603484 milliseconds. Therefore, the value chosen for scheduler.client.request.timeout and scheduler.client.request.IUDEnd.timeout is 720000 (12 minutes).

  • After changing the schedule timeout properties, it is important to restart the entire Db2 Big SQL service, and not just the Db2 Big SQL scheduler component, as these properties are client-level properties and must be propagated to all the worker nodes.
  • It is also important to review how many files and partitions are present in any Hadoop table that is exhibiting a longer than average elapsed time in the registerQueryNew and requestScanMetadata operations, as per the bigsql-sched-recurring-diag-info.log file. Having too many small files in a table is generally discouraged by the Hadoop community. As a general rule of thumb, it is recommended that no individual table have more than a few hundred thousand files or a few thousand partitions.

Reading files in HDFS subdirectories

The scheduler.read.subdirectories parameter controls whether data in subdirectories is read by Db2 Big SQL. The default is true. For example, consider an external table that is located in /tmp/testtable/:

[root@bdsup006 ~]# hadoop fs -ls -R /tmp/testtable/
drwxr-xr-x - bigsql hdfs 0 2015-12-09 22:29 /tmp/testtable/subtab
-rw-r--r-- 3 bigsql hdfs 6 2015-12-09 22:29 /tmp/testtable/subtab/text2.txt
-rw-r--r-- 3 bigsql hdfs 6 2015-12-09 22:28 /tmp/testtable/text1.txt
In this example, with the scheduler.read.subdirectories parameter set to true, both text1.txt and text2.txt are read.

If you do not want subdirectories to be read, set this option to false, and restart the Db2 Big SQL service. In this case, only text1.txt is read.

Scheduler cache configuration

To improve performance, the Db2 Big SQL scheduler caches the responses that are generated by the Hive metastore and the HDFS NameNode. By default, the scheduler caches this metadata for 20 minutes. You can change this value in one of the following ways:
  • If you require longer or shorter cache times, change the value of the scheduler.tableMetaDataCache.timeToLive parameter to an integer that represents the required cache time, in milliseconds.
  • If you require synchrony with the Hive metastore and the HDFS NameNode metadata, disable the scheduler cache by setting the scheduler.tableMetaDataCache.timeToLive parameter to 0.
  • If you require some tables to leverage the scheduler cache and other tables to maintain synchrony with the Hive metastore and the HDFS NameNode metadata, you can disable the scheduler cache for the latter.
  • You can also manipulate cache times by changing the scheduler.tableMetaDataCache.expirePolicy parameter. The default value is expireAfterAccess, which resets the timer every time a cache item is accessed before it has expired. The alternative is expireAfterWrite, which means that the scheduler.tableMetaDataCache.timeToLive timer starts counting down when cache items are first written into the cache. The expireAfterAccess value is a method for extending the lifespan of metadata in the cache for more frequently accessed tables.

Disabling the scheduler cache for a set of tables

You can selectively disable the scheduler cache in one of the following two ways:
  • Disable the scheduler cache based on a regular expression match. The scheduler.cache.exclusion.regexps property in the Db2 Big SQL configuration file accepts a comma-separated list of regular expressions. This list must have an even number of elements. Each pair of regular expressions in the list is matched against table schemas and names, and all matching tables are excluded from the scheduler cache. For example, to match all table names that have a prefix of uncached, you could use the regular expression pair .*,uncached.*, which would match the following tables:
    Be sure to restart the scheduler after any configuration change.
  • Disable the scheduler cache based on a table property. You can use the ALTER TABLE statement to disable the scheduler cache for a particular table. For example:
    ALTER TABLE schema1.table1
      SET TBLPROPERTIES('scheduler.cache.exclusion'='true');
    This operation takes effect immediately.

Error logs

When the scheduler receives or generates an exception, an error (SQL5105N) is produced with the name of the component that issued the exception. The error includes a log entry identifier, a three-part name that enables you to identify which file to refer to for a particular error message. The identifier consists of three parts: SCL-NNN-XXXYYYZZZ. You can use the SYSHADOOP.LOG_ENTRY table function to get log file information for a particular log entry identifier. Alternatively, you can manually search the $BIGSQL_DIST_VAR/logs/bigsql-sched.log file.

If the reader or writer invokes the Db2 Big SQL scheduler and receives an exception, the log entry identifier might be specific to the Db2 Big SQL readers and writers. Therefore, you might need to search the logs that are specific to the readers and writers.

For details about log entry identifiers and the locations of the log files, see Db2 Big SQL log management on CDP.