Duplicate document detection

Duplicate document detection is a technique that is used to prevent search results from containing multiple documents with the same or nearly the same content.

Search quality might be degraded if multiple copies of the same (or nearly the same) documents are listed in the search results. Duplicate document analysis cannot occur when collection security is enabled.

During global analysis, the indexing processes detect duplicates by scanning the document content for each document. If two documents have the same document content, they are treated as duplicates.

When you specify that a field or metadata field constitutes document content, the content of those fields is added to the dynamic summary of the document in the search results, which can have an impact on whether the document is displayed in the search results. If near duplicate detection is enabled in the application (the NearDuplicateDetection property in the setProperty method is set to Yes), documents with similar titles and summaries are suppressed when a user views search results.

In a group of duplicate documents, one document is the master and the others are duplicates. All documents in the group of duplicates have the same canonical representation of the content. During indexing, the content (tokens) of the master document are indexed. For the duplicate documents, only the metadata tokens are indexed. When the master document is deleted from the index, the next duplicate becomes the master. When users search the collection, only the master document is returned.

Setting up duplicate document detection
When you create a collection, you can specify whether you want to enable duplicate document detection for the collection. You can also enable or disable duplicate document detection by changing general options for the collection.

If you enable this function for a collection, you can configure schedules to control when the detection process runs. Because duplicate document detection runs only when the index build is paused, you might want to schedule a time for it to run to ensure that it runs only when crawlers are not actively adding content to the index.

Viewing duplicate documents in the content analytics miner
In the content analytics miner, users can specify preferences for viewing information about duplicate documents. For example, users can see what percentage of documents that match the current search conditions are duplicates. They can also select a document and be shown a list of documents that are similar to the selected document. Users can also set a sliding scale preference to control how similar documents must be to the selected document for them to be included in the list of similar documents.