Considerations for sizing hash tables

When determining the size of hash tables, consider such factors as the number of subscriptions, source application patterns and workloads, and your desired amount of apply parallelism.

The DEPGRAPHHASHSZ configuration parameter identifies how much of the dependency graph storage (identified using the DEPGRAPHMEMORY configuration parameter) is reserved for one or more hash tables for a subscription.

The DEPGRAPHHASHSZ values are specified in 1 KB increments, so the optimal value of 65535 reserves 67,107,840 out of the 1000 MB default dependency graph storage for each subscription that you defined. This is the optimal value because it creates the largest possible hash table size and limits hashing collisions that cause unnecessary dependencies. The optimal setting is adequate for a few subscriptions (probably around six) but cannot be used as-is in larger deployments. The default value of 2000 is a balance between environment size, storage usage, and avoiding hashing collisions.

The dependency graph consists of hash table slots and synonym chains. Each hash table entry is 16 bytes long, so each DEPGRAPHHASHSZ unit reserves 64 hash table entries for a subscription. Typically, the DEPGRAPHKEYS configuration parameter is set to 2. This setting means that the number of hash table entries is divided by two because a hash table is maintained by resource name (DSN) and a second table is maintained by using the resource name and the root key as the hash value.

A 32-bit hashing algorithm is used to convert resource names and resource names/root keys into an index for the hash table, which is used to record a synonym chain that identifies each UOR that contains updates that hash to that specific entry. When multiple UORs hash to the same set of entries, dependencies are created that restrict the amount of parallelism that can be achieved.

Because most applications tend to update one or more data sets and different keys, the general rule is that the larger the hash table (the more slots) the more source UORs can be applied in parallel. Conversely, the smaller the number of hash table entries, the less parallelism you can achieve because updates to different resources/keys are likely to hash to the same entry in a smaller hash table “name space.”

Using the recommended DEPGRAPHHASHSZ setting of 65535 creates the largest number of hash table entries that is supported for a single subscription. Setting MAXWRITERTHREADS to the maximum of 255 gives you the best chance of getting the maximum parallelism based on the update patterns of your source applications. The actual levels of parallelism that you see are likely to vary considerably based on variations in the workflow patterns and loads of source applications.

The best way to test these kinds of configurations is by processing historical source changes. Most sites have a desired “catch-up rate” where changes are being applied to the target at a higher average rate than they are being generated by the source applications. Increasing the size of the hash table is the primary tool that you use to increase target apply rates to meet your catch-up objectives.