Automatic Hive catalog syncing to the Db2 Big SQL catalog

Tables that are created, altered, or dropped by Hive clients can have the associated catalog changes automatically applied to Db2® Big SQL.

Enabling automatic syncing

Automatic syncing of the Db2 Big SQL catalog and the Hive metastore is enabled by default. If you have disabled automatic syncing (see Disabling automatic syncing), you can enable this feature again in Ambari (IBM Db2 Big SQL > Configs > Advanced bigsql-env > Automatic metadata sync enabled). Restart the Db2 Big SQL service. There might be a delay of a few minutes before automatic syncing resumes.

When you enable this feature, you also create a scheduled event for the following synchronization call:
db2 "CALL SYSPROC.ADMIN_TASK_ADD ('Synchronise MetaData Changes from Hive',
  NULL, NULL, NULL, NULL, 'SYSHADOOP', 'HCAT_AUTOSYNC_OBJECTS', NULL, NULL, NULL)
You can invoke the HCAT_SYNC_OBJECTS stored procedure to add all current Hive tables from all schemas to the Db2 Big SQL catalog:
db2 "CALL SYSHADOOP.HCAT_SYNC_OBJECTS('.*', '.*', 'a', 'REPLACE', 'CONTINUE', 'TRANSFER OWNERSHIP TO HIVEOWNER')"
Important:
  • Non-transactional bucketed tables are not supported in Db2 Big SQL and therefore are not synchronized.
  • Tables that are created under the Hive default schema are not automatically synced; you must synchronize these tables manually if you want them in Db2 Big SQL.

Running Metadata Sync

You can invoke the HCAT_SYNC_OBJECTS stored procedure for all current schemas and tables by selecting the Run Metadata Sync service actions menu item. With impersonation enabled, this action simply reports that impersonation is turned on. Users are responsible for synchronizing their own tables and schemas. The bigsql user does not necessarily have permission to query a table in the HDFS, and invoking this action as bigsql for all user schemas and tables might result in permissions errors.

Monitoring the synchronization

The bigsql.log file receives log information directly from the HCAT_AUTOSYNC_OBJECTS process.

Automatic syncing uses a scheduled DB2® administrative task scheduler (ATS) sync procedure. The description of the sync procedure is Synchronise MetaData Changes from Hive. You can monitor this procedure like any other ATS scheduled procedures. For information about ATS and how to monitor such procedures, see Administrative task scheduler.

By default, event files are placed in the HDFS under /user/bigsql/sync. The file pattern for those files is fixed, and only files that match this pattern are processed. The general file pattern is DDL-timestamp-UUID.json. For example: DDL-[0-9]{17}-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}.json. If files are left in this directory for some time, it is good practice to investigate. If you disable automatic syncing, files collect in this directory but are not processed. You must determine whether or not such files should be retained. Alternatively, in Ambari you can navigate to Hive > Configs > Advanced > Custom hive-site, manually remove the hive.metastore.event.listeners entry, and then restart Hive. In this case, files do not accumulate in the sync directory.

Disabling automatic syncing

Disabling automatic syncing removes the scheduled process. Events are still gathered and placed in the sync directory, but are never processed. The following example shows how to disable automatic syncing:
db2 "CALL SYSPROC.ADMIN_TASK_REMOVE ('Synchronise MetaData Changes from Hive',NULL)"

Hive listener

The automatic Hive catalog synchronizing (HCAT_AUTOSYNC_OBJECTS) uses a Hive listener to get notified when changes have occurred in the Hive metastore. During Db2 Big SQL install, the default listener is set in the Hive configuration as shown:
hive.metastore.event.listeners=com.ibm.biginsights.bigsql.sync.BIEventListener
As well, the bigsql-sync.jar file is copied to the hive server node.

The BIEventListener will create an HDFS event file in the /user/bigsql/sync directory for each Hive DDL event. For each event, the Big SQL catalog is synchronized and the event file is deleted. If an error occurs while synchronizing an event then the event file is moved to /user/bigsql/sync/errors.

Another Hive listener, DbNotificationListener, can be used instead of BIEventListener. DbNotificationListener will add events to the Hive metastore NOTIFICATION_LOG table. The events are read from the metastore by the automatic Hive catalog synchronizing (HCAT_AUTOSYNC_OBJECTS).

DbNotificationListener is built into Hive and does not require the bigsql-sync.jar file to be installed on the Hive server. DbNotificationListener will handle both DDL and DML events and it can be used with Hortonworks Data Platform (HDP) 2.6.3 and above. When DML events occur the Big SQL scheduler cache is refreshed.

To use DbNotificationListener, make the following modifications to the Hive configuration through Ambari:
  1. Remove the com.ibm.biginsights.bigsql.sync.BIEventListener class from the hive.metastore.event.listeners property. Check that hive.metastore.transactional.event.listeners is set to org.apache.hive.hcatalog.listener.DbNotificationListener. Either hive.metastore.event.listeners or hive.metastore.transactional.event.listeners (but not both properties) should be set to org.apache.hive.hcatalog.listener.DbNotificationListener.
  2. Set hive.metastore.dml.events=true.
  3. Set metastore.metastore.event.db.notification.api.auth=false to allow the Db2 Big SQL service to read events.
  4. It is recommended to increase the length of time that events remain in the metastore before being deleted. The default setting is one day. Set hive.metastore.event.db.listener.timetolive=604800 to have events remain in the metastore for one week.
  5. Restart the Hive and Db2 Big SQL services.
The following table shows a comparison of the two Hive listeners.
  BIEventListener DbNotificationListener
HDP versions supported 2.5.x and above 2.6.3 and above
DDL events:
CREATE/ALTER/DROP TABLE
YES YES
DML events:
ADD/DROP/ALTER PARTITON
INSERT
No YES
Event storage location HDFS Hive metastore

Synchronization errors

In the event of a synchronization failure, the associated event-file is moved from the events-directory (by default: /user/bigsql/sync) to an "errors" sub-directory (by default: /user/bigsql/sync/errors). It is good practice to monitor this directory for any synchronization errors and address any issues where necessary.

When any issues have been addressed, and where synchronization for the associated object(s) is still required, the relevant event-files can then be moved back into the events-directory (by default: /user/bigsql/sync) where automatic synchronization will take place.

For more information, see Automatic syncing of the Db2 Big SQL and Hive catalogs.