Duplicate document analysis and collection security

If you enable collection security, the global analysis processes do not identify duplicate documents in the collection.

During global analysis, the indexing processes identify documents that are duplicates (or near duplicates) of each other. They then associate all of these documents with one canonical representation of the content. By allowing duplicate documents to be identified, you can ensure that query results do not contain multiple documents with the same (or nearly the same) content.

If you enable collection security when you create a collection, duplicate documents are not identified, and so they are not associated with a common canonical representation. Instead, each document is indexed independently. This ensures that the security controls for each document are evaluated so that users search only the documents with security tokens that match their credentials. Two documents might be nearly identical in content, but use different access control lists to enforce security.

For example, for two duplicate documents, document_A and document_B, assume that a user has access rights only to document_B. If document_B is eliminated by duplicate detection, then the user cannot see the document in the query results because of the access constraints that are in place for document_A.

Disabling duplicate document analysis can enhance the security of documents in a collection, but search quality might be degraded if users receive multiple copies of the same document in the query results.