Deduplication

Deduplication ensures that only one copy of a document or an embedded attachment is kept in the archive, no matter how many times the same document or attachment was archived by different users.

Examples of duplicate content are identical email documents in multiple mailboxes, identical attachments in otherwise different email documents, or identical files that are stored in multiple locations. To avoid storing redundant data, Content Collector uses these two layers of deduplication:
Content Collector deduplication
You can configure Content Collector to detect duplicates and to add only unique content. Only one document object is created in the repository. For new instances of an already archived document, no new document objects are created in the repository. A single document object represents all instances of a document. As a result, no duplicate objects appear when you browse or search for documents, and no workflow is triggered in FileNet® P8.

Content Collector deduplication depends on the calculation of a unique deduplication hash key. Each source connector type calculates its hash keys differently.

You configure Content Collector deduplication in the archiving task route. The mail archiving templates that are provided with the product are configured to use Content Collector deduplication. You should configure your task routes accordingly to ensure that only unique content is archived.

Repository-based deduplication
Repository-based deduplication for IBM® Content Manager is done on the storage device layer and through IBM Spectrum Protect (TSM).

For FileNet P8, the system uses the native deduplication that is provided by FileNet P8 and device level deduplication. For FileNet P8 deduplication, suppression of duplicate content elements must be enabled on the storage area. With FileNet P8 deduplication, only one copy of the content is physically stored. For new instances of an already archived document, further document objects are created that all point to the same physical copy in the storage area. As a result, when browsing or searching, each duplicate appears to be a unique document. For detailed information about native FileNet P8 deduplication, see the topics about suppressing duplicate content elements in the FileNet P8 documentation.

Note that repository-based deduplication does not work for email that was ingested through Content Collector because each email document is considered a different copy. The instances of an email that was sent to multiple recipients are not binary identical. Therefore, traditional deduplication algorithms do not recognize them as duplicates. Attachments, however, are deduplicated because the binary hash key for the attachment does not change.

Email and attachment deduplication

If you configure your task route accordingly, both of the described layers of deduplication apply to email archiving. Content Collector deduplication reduces the number of objects that are created in the repository and thus reduces the space that is used for storing multiple copies of the same email. Repository-based deduplication reduces the space that is used for storing attachments when the same file was attached to different email documents.

Content Collector uses a prescriptive email data model for compliance archiving, space management, and duplicate management. With this data model, Content Collector always archives the complete email; you cannot archive only the attachments. In the repository, however, the email document and the attachments are stored separately. In business process management (BPM) task routes, the Content Collector email data model is not used; in these task routes, it is therefore possible to archive only the attachments.

Email deduplication happens only if the copies of the email have the same format. If, for example, the same email is present in SMTP/MIME format and in a native email server format for Domino or Exchange, no deduplication occurs for the email bodies. With deduplication, Content Collector stores only one copy of identical email, whether you work with an IBM Content Manager repository or a FileNet P8 repository.

Attachments are handled differently. Attachments archived in IBM Content Manager in item types created using the compound email data model are subject to hash key based Content Collector deduplication. Only one copy of each unique attachment is stored, regardless of the number of email documents (identical or not) in which the same attachment is used.

Deduplication of attachments archived into FileNet P8 is handled by Content Collector only for identical email. If a file is attached to multiple email documents that are not copies of one another, deduplication is not handled by Content Collector. Deduplication of such attachments is handled by FileNet P8 internally, if native FileNet P8 deduplication is enabled.

If an attachment is also ingested as a document through IBM Connections, Microsoft SharePoint, or File System, no Content Collector deduplication is provided.

For email, the hash key calculation takes document elements or metadata as input data. Content Collector detects duplicate email no matter if the email source is journal, sent, or received, and stores only one copy of the document, except for these cases:
  • When blind-carbon-copy (Bcc) recipients are included in the Microsoft Exchange or Lotus® Notes® email document
  • When a Microsoft Exchange email document contains tracking information
  • When Information Rights Management (IRM) is used in Microsoft Exchange. In this case, up to three different copies of the email document are stored in the archive:
    • the sender's unprotected copy (in the Sent Items folder)
    • the recipients' IRM-protected copy
    • the decrypted copy from the journal report (if Microsoft Exchange journal decryption is enabled and Content Collector is configured to archive the decrypted copy)
In these cases, different hash keys are calculated for the sender copy and for the copy for the journal and all recipients including Bcc recipients. Therefore, two copies of the email document are stored, one copy for the journal and all recipients including Bcc recipients and one copy for the sender. This is because only for the sender the list of Bcc recipients is restored. For each recipient, the restored email document does not contain a Bcc recipient list. This is also true for recipients originally on Bcc. For more information, see the topic about calculating the deduplication hash keys that applies to your mail system.
Restriction: The following information applies to Lotus Notes only.

Usually, the sender copy of a signed email is not signed. Only the journal copy and the recipient copies are signed. Because only a single copy of the signed email is stored in the repository, this copy might not contain the signature if the sender copy was archived first. To avoid this problem for compliance archiving, use journal archiving. In most cases, journal archiving happens before archiving from user mailboxes, so that a signed copy of the email should be archived first. However, this might not work in all cases. See the topic about calculating the deduplication hash keys for Lotus mail documents for more information.

Deduplication of file system, IBM Connections, and Microsoft SharePoint documents

If you configure your task routes accordingly, the file system, IBM Connections, and Microsoft SharePoint connectors create a standard MD5 hash that can be used for deduplication. An MD5 hash is produced by the MD5 Message-Digest Algorithm. This algorithm takes the file contents and returns a 128-bit value. You enable hash key generation as follows:
  • For file system documents, in the file system collector
  • For IBM Connections documents, in the CX Pre-processing task
  • For Microsoft SharePoint documents, in the SP Create File task
Content Collector deduplication is configured in the repository connector tasks (CM 8.x Duplicate Detection for IBM Content Manager, and P8 Create Document or P8 Find Duplicate Email for FileNet P8). Content Collector then considers any two documents with the same hash key as identical and thus stores only one of them in the repository.

If your repository is FileNet P8, you can configure deduplication in Content Collector to avoid that duplicate objects appear in IBM FileNet Enterprise Manager or IBM FileNet Workplace XT. Note that with Content Collector deduplication no FileNet P8 workflow is triggered for duplicate objects.

For IBM Connections documents, use hash key based deduplication only for items that contain no more than one part, like files. Hash keys for items that consist of several parts are likely to differ even if the content of the items is identical.

For Microsoft SharePoint documents, do not configure deduplication in Content Collector if you collect document versions or Microsoft Office documents (Microsoft SharePoint changes the metadata that the collector uses to identify identical documents). Trying to deduplicate these types of documents slows your system without achieving the expected result.

If you do not configure Content Collector deduplication, repository-based deduplication applies.

Example

Suppose an executive has sent an email to 30 employees. Because the content of the email is important, all 30 users archive the email using the IBM Content Collector archiving function of their email client.

Normally, the email would be saved in the archive 30 times, once for each user. Deduplication avoids this waste of archiving space. IBM Content Collector checks if the document belonging to an archiving request already exists in the archive. If it exists, it is not archived again. In this case, only the information that is needed to restore each individual copy of the email is stored in the repository and IBM Content Collector performs the post-archiving actions as configured for the clients. That is, it creates a document stub for each original document.