Support for crawling compound documents

If a document contains multiple parts, such as attachments or content elements, and you want all parts of the document to be treated as a single document in the search results, you can configure certain crawlers to support compound documents.

The following crawlers provide compound document support:
  • Exchange Server
  • FileNet P8
  • Notes
  • SharePoint

When compound document support is enabled in the crawler configuration, a parent document that contains child documents is searched as a single document. If the search terms are found in the parent or a child, all of the child documents are listed with the parent document in the search results. To help users identify documents of potentially greater interest, the links to child documents that contain the search terms are highlighted in a bold font. If the parent document does not match the search terms, but one or more child documents do, then the document summary is created from the first child document that matches the search terms.

Restriction: With compound documents, all textual information, such as the document body and text in fields or metadata, is treated as if the text was merged into one document. All of the fields from the parent document and child documents are indexed as a single document. One document ID, from the parent document, is assigned to the entire set. If you add, update, or delete a document in the set, the entire set of documents must be indexed again, even if some documents in the set did not change.

If support for compound documents is not enabled in the crawler configuration, the parent and child documents are searched separately and returned as separate documents in the search results. For example, if a document in a Lotus Notes® database contains two attachments, the document becomes three documents in the index, one for the document body and two for the attachments. The three documents are searched as if they are unrelated documents. Thus, a query might return three separate documents even though the three parts comprise a single document.

For another example, a Boolean AND query might not return any of three documents if two search terms (this AND that) exist in separate attachments. In contrast, when compound document support is enabled, a Boolean AND query searches across the entire set of documents. The compound document is returned in the search results even if one attachment contains this and the other attachment contains that.

Exchange Server crawlers

Microsoft Exchange Server supports attachments to mail, calendars, tasks, and contacts. In some cases, an attachment to mail can have its own attachments. When the Exchange Server crawler is configured to support compound documents, the parent document is published with its attachments as one compound document. The parent and all parts are indexed as one document.

If the Exchange Server crawler is not configured to support compound documents, the parent documents are indexed separately from their attachments. For example, a search might return separate documents for the parent email and each of its attachments.

FileNet P8 crawlers

In FileNet® P8, differences exist between documents that contain multiple content elements and documents that are considered compound documents in FileNet P8.
Documents with multiple content elements
If the FileNet P8 crawler is not configured to support compound documents, a FileNet P8 document that contains multiple content elements gets indexed as multiple documents that have similar metadata, but different binary content. For example, if a FileNet P8 document includes two files, they are indexed as two documents that have almost the same metadata. Some attributes derived from the content elements differ, such as the file name, file size, content type, and so on.

If you configure the FileNet P8 crawler to support compound documents, FileNet P8 documents that contain multiple content elements are treated as a single document in the index. The indexed compound document contains all content elements that are associated with the multiple element document.

Compound documents in FileNet P8
In Watson Explorer Content Analytics, a compound document consists of binary content and limited metadata associated with the binary content. If a parent document and child document have different metatdata, Watson Explorer Content Analytics indexes both documents as a single document.

In FileNet P8, a compound document is a document that can contain other documents. The parent and child documents can have completely different metadata, and the documents likely need to be indexed and searched as separate documents.

To support indexing and searching FileNet P8 compound documents, the FileNet P8 crawler treats all parts of the document as separate documents with distinct metadata. They are not indexed as a single document in the Watson Explorer Content Analytics index.

Notes crawlers

When the Notes crawler is configured to support compound documents, the crawler treats attachments as child documents. As a result, one Notes document always becomes the parent document. The parent document is published with its child documents as one compound document. The parent and all parts are indexed as one document.

If the Notes crawler is not configured to support compound documents, the parent documents are indexed separately from their attachments. For example, if a Notes document has two attachments, the crawler produces three documents. A search might return separate documents for the parent document and each attachment.

SharePoint crawlers

SharePoint Server documents support attachments to calendars and tasks. When the SharePoint crawler is configured to support compound documents, the parent document is published with its attachments as one compound document. The parent and all parts are indexed as one document.

If the SharePoint crawler does not support compound documents, attachments are indexed separately from the parent document, but they include much of the same metadata as the parent document. For example, if a SharePoint calendar has two attachments, the content is indexed as three separate documents that have nearly the same metadata. A few fields, such as the file name, file size, and last modification date, are different for each document.