Support for crawling compound documents
If a document contains multiple parts, such as attachments or content elements, and you want all parts of the document to be treated as a single document in the search results, you can configure certain crawlers to support compound documents.
- Exchange Server
- FileNet P8
- Notes
- SharePoint
When compound document support is enabled in the crawler configuration, a parent document that contains child documents is searched as a single document. If the search terms are found in the parent or a child, all of the child documents are listed with the parent document in the search results. To help users identify documents of potentially greater interest, the links to child documents that contain the search terms are highlighted in a bold font. If the parent document does not match the search terms, but one or more child documents do, then the document summary is created from the first child document that matches the search terms.
If support for compound documents is not enabled in the crawler configuration, the parent and child documents are searched separately and returned as separate documents in the search results. For example, if a document in a Lotus Notes® database contains two attachments, the document becomes three documents in the index, one for the document body and two for the attachments. The three documents are searched as if they are unrelated documents. Thus, a query might return three separate documents even though the three parts comprise a single document.
For another example, a Boolean AND query might not return any of three documents if two search terms (this AND that) exist in separate attachments. In contrast, when compound document support is enabled, a Boolean AND query searches across the entire set of documents. The compound document is returned in the search results even if one attachment contains this and the other attachment contains that.
Exchange Server crawlers
Microsoft Exchange Server supports attachments to mail, calendars, tasks, and contacts. In some cases, an attachment to mail can have its own attachments. When the Exchange Server crawler is configured to support compound documents, the parent document is published with its attachments as one compound document. The parent and all parts are indexed as one document.
If the Exchange Server crawler is not configured to support compound documents, the parent documents are indexed separately from their attachments. For example, a search might return separate documents for the parent email and each of its attachments.
FileNet P8 crawlers
- Documents with multiple content elements
- If the FileNet P8 crawler is not configured to support
compound documents, a FileNet P8 document that
contains multiple content elements gets indexed as multiple documents that have
similar metadata, but different binary content. For example, if a FileNet P8 document includes two files, they are indexed as
two documents that have almost the same metadata. Some attributes derived from the
content elements differ, such as the file name, file size, content type, and so
on.
If you configure the FileNet P8 crawler to support compound documents, FileNet P8 documents that contain multiple content elements are treated as a single document in the index. The indexed compound document contains all content elements that are associated with the multiple element document.
- Compound documents in FileNet P8
- In Watson Explorer Content Analytics, a compound document consists of binary content
and limited metadata associated with the binary content. If a parent document and
child document have different metatdata, Watson Explorer Content Analytics
indexes both documents as a single document.
In FileNet P8, a compound document is a document that can contain other documents. The parent and child documents can have completely different metadata, and the documents likely need to be indexed and searched as separate documents.
To support indexing and searching FileNet P8 compound documents, the FileNet P8 crawler treats all parts of the document as separate documents with distinct metadata. They are not indexed as a single document in the Watson Explorer Content Analytics index.
Notes crawlers
When the Notes crawler is configured to support compound documents, the crawler treats attachments as child documents. As a result, one Notes document always becomes the parent document. The parent document is published with its child documents as one compound document. The parent and all parts are indexed as one document.
If the Notes crawler is not configured to support compound documents, the parent documents are indexed separately from their attachments. For example, if a Notes document has two attachments, the crawler produces three documents. A search might return separate documents for the parent document and each attachment.
SharePoint crawlers
SharePoint Server documents support attachments to calendars and tasks. When the SharePoint crawler is configured to support compound documents, the parent document is published with its attachments as one compound document. The parent and all parts are indexed as one document.
If the SharePoint crawler does not support compound documents, attachments are indexed separately from the parent document, but they include much of the same metadata as the parent document. For example, if a SharePoint calendar has two attachments, the content is indexed as three separate documents that have nearly the same metadata. A few fields, such as the file name, file size, and last modification date, are different for each document.