Additional requirements for Databricks

Includes the additional requirements while configuring CDC Flexrep to target Databricks to ensure transactional integrity and performance.

All the algorithms that preserve source transaction consistency and consolidate multiple operations for a row are NetEffect algorithms. For instance, if a row is inserted then deleted in the same Unit of Work (UOW), then CDC throws away both the insert and delete. CDC reorders operations within a UOW to optimize execution. Specifically:

  • Deletes for each table are grouped together and applied consecutively.
  • Inserts for each table are grouped together and applied consecutively.

This ordering enables deletes and inserts to be processed as single batches per table, rather than as individual operations. For optimal performance in Databricks, apply operations in batches. Executing single‑row operations is highly inefficient and results in significantly slower performance.

Selecting a Fast Apply Algorithm

To decide which fast apply algorithm to use for your Databricks target, consider these characteristics:
  • GroupByTableNetEffectConvertUpdates uses a single thread. Select this if parallelism is not required.
  • ParallelizeByHashNetEffectConvertUpdates is a parallel algorithm apply algorithm where the tables are split amongst the apply threads through a hashing algorithm. This ensures all apply threads are given a uniform distribution of the workload, even if some tables are busy and others are relatively idle.
  • ParallelizeByTableNetEffectConvertUpdates is a parallel algorithm apply algorithm where the tables are split amongst the apply threads through a round-robin algorithm. Select this if your tables have roughly the same activity level.

Configuring Fast Apply Algorithm Properties

Perform these steps to configure the fast apply algorithm properties:
  1. Set the UOW threshold to a value greater than or equal to <number of tables in a typical UOW> * global_max_batch_size * 2.
  2. Set the number of apply threads initially to 4 (for parallel fast apply algorithms) and increase as needed to achieve the desired throughput.

Special Considerations for Databricks

Important aspects to consider for effective use and management of Data bricks:
  • To avoid exceeding Databricks' 255 SQL parameter limit, set the CDC system property target_apply_single_multirow_statement_column_values_inline to true to append insert or delete column values as literal strings instead of SQL parameters.
  • To improve the mirroring performance, increase the CDC system property global_max_batch_size to a value much larger than the default (50), typically 2000 or more. However, be cautious of very long SQL statements that may degrade performance. Increase the batch size until performance starts to decrease for the typical workload. Ensure that the fast apply threshold is set substantially higher than the value defined for this property, because a batch cannot extend across UOW boundaries.
  • To improve the table refresh (load) performance, increase the CDC system properties jdbc_refresh_max_batch_size and jdbc_refresh_commit_after_max_operations to larger values (for instance, 5000 or more).
    Note: jdbc_refresh_max_batch_size cannot exceed the value set for jdbc_refresh_commit_after_max_operations.
  • Enable parallel refresh by setting target_allow_parallel_jdbc_refresh to true and specifying the desired level of parallelism (target_num_refresh_apply_threads, default is 4).
  • To apply multiple inserts and deletes in a single batch for optimal performance, set target_apply_insert_batch_via_single_statement and target_apply_delete_batch_via_single_statement to true.

Summary of Required Properties for Databricks

Property Name Default Value Recommended Value
target_apply_insert_batch_via_single_statement False True
target_apply_delete_batch_via_single_statement False True
target_apply_single_multirow_statement_column_values_inline False True
global_max_batch_size 50 2000 or more
jdbc_refresh_max_batch_size 500 5000 or more
jdbc_refresh_commit_after_max_operations 1000 5000 or more
target_allow_parallel_jdbc_refresh False True
target_num_refresh_apply_threads 4 8 or more

Defining Replication Keys

CDC requires a set of target table columns that uniquely identify each row. This set of columns is referred to as the replication key. To ensure optimal performance and reliability, please consider the following guidelines:
Use Primary Keys as Replication Keys
The most effective replication key columns are the table’s primary key columns.
When a primary key constraint is declared, CDC automatically selects those columns as the replication key.
Declare Primary Key Constraints in Databricks
In Databricks workspaces that support Unity Catalog (with a default catalog other than Hive), primary key constraints are supported.
It is strongly recommended to declare primary key constraints on all target tables.
Define Replication Keys When No Primary Key Exists
If a primary key constraint is not declared, explicitly define replication key columns that logically form a unique key.
Ensure these columns are declared as NOT NULL to maintain data integrity.
Prefer Single‑Column Primary Keys
Multi‑column primary keys can significantly slow down multi‑row delete operations due to Databricks SQL optimizer limitations.
Whenever possible, declare a single‑column primary key to improve performance.

JDBC Driver issues

The Databricks JDBC driver requires a specific JVM option to be specified. Without this configuration, you may encounter connection failure errors while starting the dmts64 service and while saving instance settings in dmconfigurets. To prevent the errors, the following JDBC option must be present in the dmts64.vmargs in the <CDC Replication instance directory>/conf: --add-opens:: java.base/java.nio=ALL-UNNAMED

Add the JDBC option at the beginning of the file if not present. Ensure there is a trailing space separating it from the existing options. Once added, the dmts64.vmargs file must look like:
--add-opens java.base/java.nio=ALL-UNNAMED -Dorg.apache.commons.logging.Log=com.datamirror.ts.util.trace.TraceSimpleLog -Dcom.datamirror.ts.instance=%TSINSTANCE% com.datamirror.ts.bootstrap.Launcher com.datamirror.ts.commandlinetools.script.Startup

Schemas and Catalogs

When creating a CDC Replication Engine for FlexRep, you must specify a schema where CDC can generate its metadata tables. This schema must reside in the default catalog for the CDC user, and the user must have authority to create and drop tables within it. Replicated tables can be located in this schema or in any other schema within the user’s default catalog. However, only tables in the default catalog for the CDC user are eligible for replication. If replication is required from a different catalog, you must create a separate CDC instance with a different user, and that catalog must be set as the default for that user.