Record Deduplicator
The Record Deduplicator evaluates records for duplicate data and routes data to two streams - one for unique records and one for duplicate records. Use the Record Deduplicator to discard duplicate data or route duplicate data through different processing logic.
The Record Deduplicator can compare entire records or a subset of fields. Use a subset of fields to focus the comparison on fields of concern. For example, to discard purchases that are accidentally submitted more than once, you might compare information about the purchaser, selected items, and shipping address, but ignore the timestamp of the event.
To enhance pipeline performance, the Record Deduplicator hashes comparison fields and uses the hashed values to evaluate for duplicates. On rare occasions, hash functions can generate collisions that can cause records to be incorrectly treated as duplicates.
Comparison Window
The Record Deduplicator caches record information for comparison until it reaches a specified number of records. Then, it discards the information in the cache and starts over.
You can configure a time limit to trigger a cache refresh at regular time intervals. When you configure a time limit, the time limit takes precedence over the record limit.
When you stop the pipeline, the Record Deduplicator discards all information in memory.