Troubleshooting
Problem
If an email contains attachments and any of these attachments is an email in SMTP format and itself contains embedded attachments, these embedded attachments will not be indexed correctly in Content Manager and hence become unsearchable.
Symptom
If a Lotus Notes, a Microsoft Exchange, or an SMTP email contains attachments, these attachments are sent to the Oracle OutsideIn Technology filters for text extraction during the indexing phase which is performed by the IBM Content Collector indexer for text search.
Although you can find the original email by searching for metadata, for example the sender or recipients, or for email body content, you cannot search content in the embedded SMTP email attachments. This means that if you search for keywords that occur in these attachments only, the original email document will not appear in the result list because the embedded SMTP email attachments would not have been indexed correctly.
Also, because the sequences of Base64 or UUencode data are not decoded, this data is added to the full text index as is, which leads to a significant growth in the index size which has a negative impact on the performance.
Cause
Due to a bug in the Oracle OutsideIn Technology filters versions 8.3.x and 8.4.x that are shipped with IBM Content Manager V8.4 and V8.5 which are used by both the IBM Content Collector indexer for text search and the IBM Text Search User Exit, these attachments are not recognized as email in SMTP format but as simple plain text files and hence are not processed correctly.
In addition, if the email contains embedded attachments encoded in Base64 or UUencode, this content is not decoded and indexed as a separate attachment but instead is indexed as Base64 or UUencode data sequences. This leads to a significant growth in the size of the full text index dictionary with a large number of arbitrary and extraneous data tokens which has a negative impact on the performance of the full text index. Furthermore, this index growth can lead to index size problems as the Net Search Extender full text index size is limited. For details see http://www-01.ibm.com/support/docview.wss?uid=swg21617071
These email attachments in SMTP format typically have the file name extensions .EML or .MHT .
Environment
This problem affects indexing on all platforms when using the following indexing components:
- IBM CommonStore Text Search User Exit
- IBM Fast Indexer
- IBM Content Collector Text Search Support
The following users are not affected by this issue:
- Users who do not index attachment data because, for example the IBM Content Collector indexer for text search configuration option TxtCnvEnabled is set to 0.
- Users who indexed documents using an older version of the Oracle OutsideIn Technology filters, such as V8.2.x or earlier versions which were distributed in older versions of IBM Content Manager (V8.4.0).
- Users who are indexing documents using IBM Content Manager V8.4.3 Fix Pack 4 or V8.5 Fix Pack 2 or a newer version of the Oracle OutsideIn Technology filters.
Note that this problem does not affect the processing of genuine SMTP email by the IBM Content Collector indexer for text search but only SMTP email that is stored as an attachment to the original Lotus Notes, Microsoft Exchange, or SMTP email.
Resolving The Problem
1. You must install the latest IBM Content Manager fix packs which include a fix from Oracle for the OutsideIn Technology filters.
- If you are using IBM Content Manager V8.4, you should install IBM Content Manager Fix Pack 4 for Content Manager V8.4.3 as soon as it is available. This fix pack includes a fix from Oracle for the OutsideIn Technology filters with version number V8.4.1.
This IBM Content Manager fix pack should be available in August 2014. If this issue is critical to you, you can request an interim fix for APAR IO21036 based on Fix Pack 3 for IBM Content Manager V8.4.3 from IBM Content Manager support by opening a PMR.
Note that after you have installed this IBM Content Manager fix pack you must reinstall IBM Content Collector Text Search Support V4.0 Fix Pack 2 because older versions of the IBM Content Collector Text Search Support component are not compatible with the new version 8.4.1 of the Oracle OutsideIn Technology filters.
If the IBM Content Manager Fix Pack is not available or you have not installed it yet, you should nevertheless install the V4.0 Fix Pack 2 version of the IBM Content Collector Text Search Support component to benefit from the new text analysis processing feature for pure plain text attachments. This way affected documents will be recognized and tagged accordingly (IDXRC 5 = WARN_BINARY).
- If you are using IBM Content Manager V8.5, you should install IBM Content Manager Fix Pack 2 for Content Manager V8.5 as soon as it is available. This fix pack includes a fix from Oracle for the OutsideIn Technology filters version 8.4.1.
This IBM Content Manager fix pack should be available in November 2014. If this issue is critical to you, you can request an interim fix for APAR IO21015 based on Fix Pack 1 for IBM Content Manager V8.5 from IBM Content Manager support by opening a PMR.
If you installed the IBM Content Collector V4.0 Fix Pack 2 Text Search Support component, you do not have to install another version of the IBM Content Collector Text Search Support component.
Note that If you installed a version of the IBM Content Collector Text Search Support component that is older than V4.0, you must reinstall the IBM Content Collector Text Search Support V4.0 Fix Pack 2 component because the older versions of IBM Content Collector Text Search Support are not compatible with the new version 8.4.1 of the Oracle OutsideIn Technology filters.
If the IBM Content Manager fix pack is not available or you have not installed it yet, you should nevertheless install the V4.0 Fix Pack 2 of the IBM Content Collector Text Search Support component to benefit from the new text analysis processing feature for pure plain text attachments. This way affected documents will be recognized and tagged accordingly (IDXRC 5 = WARN_BINARY).
2. Ensure that you have installed V4.0 Fix Pack 2 of the IBM Content Collector Text Search Support component. This new version of the IBM Content Collector indexer for text search contains the following features that address the issues described above:
- To prevent binary attachment content from being inadvertently recognized as 7-bit ASCII code during document processing, the document processing algorithms were enhanced by heuristics that ensure that no binary content is indexed.
The indexer processing algorithms check whether the content returned by the Oracle Outside In Technology filters is actually plain text, or whether the returned content contains patterns that indicate binary content, such as control characters (code points < 0x1F), embedded data in UUencode, or Base64 encoding. If embedded binary data is recognized, the attachment is not added to the generated XML input and indexed, but instead is assigned an IDXRC value of 5, indicating binary content, in the table of completed tasks.
This also means that attachment documents which are not recognized correctly by the Oracle OutsideIn Technology filters are tagged with an IDXRC value of 5. This also protects against future failures of the filter technology to some extent. - The command-line argument -reindexsearchfile <.ini file> was added to the afuIndexer command-line tool to enable reindexing all documents that were indexed during a previous indexing run where the search terms specified in the INI format file were indexed with the documents. When you run the -reindexsearchfile operation, multiple search terms can be specified at once for identifying all documents that need to be reindexed.
This new command-line argument can be used to identify documents that were affected by the symptoms described above and to reindex them during a second processing run.
- Create a file named smtp_attachments.ini with the following content using a text editor of your choice:
[reindexsearch]
eml = SECTIONS ("attachment") (".eml")
mht = SECTIONS ("attachment") (".mht")
mime = SECTIONS ("attachment") (".mime")
; add more file type extensions here if necessary
base64 = SECTIONS ("attachment") ("base64")
uuencode = SECTIONS ("attachment") ("begin") & ("end")
binary = "InsoErrorNoFilterAvailable" - Save the file.
afuIndexer ... -reindexsearchfile smtp_attachments.ini -reindexrc 5 ...
This operation searches all documents using the search patterns contained in the smtp_attachments.ini file and all documents for which the IDXRC was 5, and will add them to the indexer's table of open tasks for reprocessing and reindexing.
The set of documents that are reindexed includes those in which specific terms occur that indicate an attachment of type SMTP and documents with binary content (IDXRC=5).
Depending on the batch size, all of the documents in the indexer's open task table will be reindexed during this or one of the subsequent indexer runs, and the content of the embedded attachments in the SMTP attachment documents will be added to the full text index.
Note that the search queries specified above will deliver a document set that is potentially larger than the set of documents actually affected. This means that more documents will be reprocessed by the indexer than were actually affected by this problem.
After the affected documents were identified and reindexed, the full text index will include the textual content of the embedded attachments in the SMTP email attachments. This allows for a correct and complete search on all email documents and their attachments.
Note that the set of extraneous tokens that might have been added to the index during the erroneous processing of Base64 or UUencode data will not be removed from the full text index dictionary because Net Search Extender does not remove existing dictionary keys.
Hence, the size of the Net Search extender full text dictionary will not be reduced by this reindexing operation.
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
swg21677327