Additional requirements for Databricks
Includes the additional requirements while configuring CDC Flexrep to target Databricks to ensure transactional integrity and performance.
All the algorithms that preserve source transaction consistency and consolidate multiple operations for a row are NetEffect algorithms. For instance, if a row is inserted then deleted in the same Unit of Work (UOW), then CDC throws away both the insert and delete. CDC reorders operations within a UOW to optimize execution. Specifically:
- Deletes for each table are grouped together and applied consecutively.
- Inserts for each table are grouped together and applied consecutively.
This ordering enables deletes and inserts to be processed as single batches per table, rather than as individual operations. For optimal performance in Databricks, apply operations in batches. Executing single‑row operations is highly inefficient and results in significantly slower performance.
Selecting a Fast Apply Algorithm
- GroupByTableNetEffectConvertUpdates uses a single thread. Select this if parallelism is not required.
- ParallelizeByHashNetEffectConvertUpdates is a parallel algorithm apply algorithm where the tables are split amongst the apply threads through a hashing algorithm. This ensures all apply threads are given a uniform distribution of the workload, even if some tables are busy and others are relatively idle.
- ParallelizeByTableNetEffectConvertUpdates is a parallel algorithm apply algorithm where the tables are split amongst the apply threads through a round-robin algorithm. Select this if your tables have roughly the same activity level.
Configuring Fast Apply Algorithm Properties
- Set the UOW threshold to a value greater than or equal to
<number of tables in a typical UOW> * global_max_batch_size * 2. - Set the number of apply threads initially to 4 (for parallel fast apply algorithms) and increase as needed to achieve the desired throughput.
Special Considerations for Databricks
- To avoid exceeding Databricks' 255 SQL parameter limit, set the CDC system property target_apply_single_multirow_statement_column_values_inline to true to append insert or delete column values as literal strings instead of SQL parameters.
- To improve the mirroring performance, increase the CDC system property global_max_batch_size to a value much larger than the default (50), typically 2000 or more. However, be cautious of very long SQL statements that may degrade performance. Increase the batch size until performance starts to decrease for the typical workload. Ensure that the fast apply threshold is set substantially higher than the value defined for this property, because a batch cannot extend across UOW boundaries.
- To improve the table refresh (load) performance, increase the CDC system properties
jdbc_refresh_max_batch_size and
jdbc_refresh_commit_after_max_operations to larger values (for instance, 5000
or more).Note: jdbc_refresh_max_batch_size cannot exceed the value set for jdbc_refresh_commit_after_max_operations.
- Enable parallel refresh by setting target_allow_parallel_jdbc_refresh to true and specifying the desired level of parallelism (target_num_refresh_apply_threads, default is 4).
- To apply multiple inserts and deletes in a single batch for optimal performance, set target_apply_insert_batch_via_single_statement and target_apply_delete_batch_via_single_statement to true.
Summary of Required Properties for Databricks
| Property Name | Default Value | Recommended Value |
|---|---|---|
| target_apply_insert_batch_via_single_statement | False | True |
| target_apply_delete_batch_via_single_statement | False | True |
| target_apply_single_multirow_statement_column_values_inline | False | True |
| global_max_batch_size | 50 | 2000 or more |
| jdbc_refresh_max_batch_size | 500 | 5000 or more |
| jdbc_refresh_commit_after_max_operations | 1000 | 5000 or more |
| target_allow_parallel_jdbc_refresh | False | True |
| target_num_refresh_apply_threads | 4 | 8 or more |
Defining Replication Keys
- Use Primary Keys as Replication Keys
- The most effective replication key columns are the table’s primary key columns.
- Declare Primary Key Constraints in Databricks
- In Databricks workspaces that support Unity Catalog (with a default catalog other than Hive), primary key constraints are supported.
- Define Replication Keys When No Primary Key Exists
- If a primary key constraint is not declared, explicitly define replication key columns that logically form a unique key.
- Prefer Single‑Column Primary Keys
- Multi‑column primary keys can significantly slow down multi‑row delete operations due to Databricks SQL optimizer limitations.
JDBC Driver issues
The Databricks JDBC driver requires a specific JVM option to be specified. Without this
configuration, you may encounter connection failure errors while starting the
dmts64 service and while saving instance settings in
dmconfigurets. To prevent the errors, the following JDBC option must be present in
the dmts64.vmargs in the <CDC Replication instance
directory>/conf: --add-opens:: java.base/java.nio=ALL-UNNAMED
--add-opens java.base/java.nio=ALL-UNNAMED -Dorg.apache.commons.logging.Log=com.datamirror.ts.util.trace.TraceSimpleLog -Dcom.datamirror.ts.instance=%TSINSTANCE% com.datamirror.ts.bootstrap.Launcher com.datamirror.ts.commandlinetools.script.StartupSchemas and Catalogs
When creating a CDC Replication Engine for FlexRep, you must specify a schema where CDC can generate its metadata tables. This schema must reside in the default catalog for the CDC user, and the user must have authority to create and drop tables within it. Replicated tables can be located in this schema or in any other schema within the user’s default catalog. However, only tables in the default catalog for the CDC user are eligible for replication. If replication is required from a different catalog, you must create a separate CDC instance with a different user, and that catalog must be set as the default for that user.