Duplicate Elimination
Duplicate elimination is conducted in several places in the software. Clustering can be configured to manage duplicates by de-duplicating by key and perfoming a near duplicate similarity check. These options are described in the Clustering Configuration section of the documentation. The search engine will, by default, attempt to remove duplicates in two ways. First, by doing exact duplicate filtering in the crawler. Second, by doing near duplicate elimination in the query-service (by using information built during the indexing).
Search engine near duplicates are found using a probabilistic algorithm. At indexing time a small signature is computed for each document and at search time each potentially retrieved document is checked to see if it is a duplicate of any higher ranked document. That is, if two documents are considered duplicates then the better ranking one will always be returned for any query.
The algorithm works by taking all unique sequences of Indexing Options words in the document (if Indexing Options is 2, the default, then all consecutive pairs of words are used for checking for duplicates). A hash function is used to effectively pick a random word sequence from each document and if they are the same sequence of Indexing Options words then the two documents are considered the same according to this one hash function. This is repeated for Indexing Options hash functions (the default is 14). If two documents agree for each and every one of these hash functions then they are considered to be duplicates according to this one trial. 6 of these trials are repeated and at least 2 trials must consider the documents to be equal for them to be considered duplicates. These 6 trials are the signature that is built at indexing time.
To make the duplicate checking stricter, increase the number of hash functions. If you are getting false duplicates because the data that you are crawling has too small a vocabulary (the pairs of words don't capture enough about the content of the pages), you can increase the number of words per sequence. The latter is rarely needed. To understand what happens as you change the number of hash functions, we can compute the probability of documents being considered duplicates.
If two documents have p% overlap of unique sequences of Indexing Options words, then the probability that two documents are considered duplicates by a single hash function is p. Then, the probability that a single set of hash functions (trial) considers them to be duplicates is the following:
1 - ((1 - p**n-hashes)**6 + 6 * p**n-hashes * (1-p**n-hashes)**5)
This equates to the following table of probabilities for different numbers of hash functions. Documents with a probability of 2 are considered duplicates.
% overlap | 10 hashes | 12 hashes | 14 hashes | 16 hashes | 18 hashes | 20 hashes |
---|---|---|---|---|---|---|
100% | 100% | 100% | 100% | 100% | 100% | 100% |
99% | 99.9% | 99.9% | 99.9% | 99.9% | 99.9% | 99.9% |
98% | 99.9% | 99.7% | 99.6% | 99.3% | 98.8% | 98.2% |
97% | 99.4% | 98.8% | 97.9% | 96.5% | 94.8% | 92.7% |
96% | 98.2% | 96.5% | 94.0% | 90.9% | 87.0% | 82.6% |
95% | 95.8% | 92.4% | 87.9% | 82.4% | 76.2% | 69.7% |
94% | 92.3% | 86.6% | 79.7% | 72.0% | 63.9% | 55.8% |
93% | 87.5% | 79.5% | 70.3% | 60.8% | 51.5% | 42.8% |
92% | 81.6% | 71.3% | 60.4% | 49.8% | 40.0% | 31.7% |
91% | 75.0% | 62.7% | 50.6% | 39.7% | 30.3% | 22.8% |
90% | 67.8% | 54.1% | 41.5% | 30.9% | 22.4% | 15.9% |
85% | 33.7% | 20.5% | 12.0% | 6.8% | 3.7% | 2.0% |
80% | 12.9% | 5.9% | 3.0% | 1.1% | 0.5% | 0.2% |