Configuration Options in the Duplicate Filtering Section

Options with performance or memory impact in the Duplicate Filtering section of the Configuration > Indexing tab are the following:

  • Generate shingles - Shingles are hashes of regions of text that can be quickly compared to determine the probability that two sets of text are duplicates of one another. You may choose to disable shingle calculations that are normally run during indexing by setting this option to false. The only reason to disable shingle generation at indexing time is to speed up indexing and save a small amount of space.
    Tip: Duplicate filtering at search time can be disabled in the source associated with a search collection.
  • Contents to shingle - This newline-separated list identifies the content elements whose text you should use when calculating shingles to determine if documents are near-duplicates. Each content can optionally be preceded by the + or - symbols. The + symbol specifies that shingles be calculated for that content element, while a - symbol suppresses shingle generation for that content element.
  • Window size - The number of words in the window to use in the Duplicate Elimination algorithm. Increasing this value will decrease indexing performance by increasing the number of words used for duplicate detection, but will usually result in improved duplicate detection and elimination. The maximum value of this field is 20.
  • n-hashes - The number of hash functions to use in the Duplicate Elimination algorithm. The maximum value of this field is 14.