Document formats and supported code pages

Net Search Extender needs to know the format (or type) of text documents that you intend to search.

This information is necessary for indexing text documents.

Net Search Extender supports the following document formats:

TEXT: Plain text (for example, flat ASCII), in general, text without any markup
HTML: Hypertext Markup Language
XML: Extended Markup Language
Document format XML is the default for column data type XML, and is the only supported document format for that data type.
GPP: General Purpose Format (flat text with user-defined tags)
Outside In (INSO): Use this format if you are using filtering software to extract textual content from PDFs and other common text formatting tools, for example, Microsoft Word.

For the document formats HTML, XML, GPP, and the Outside In filter formats, searching can be restricted to specific parts of a document.

Where Outside In filters can not be used because the format of your document is not supported, you can write a User Defined Function (UDF) that does its own filtering. This UDF must be specified at index creation time and converts the data from the unsupported format to a supported format.

You can index documents if they are stored in one of the supported Coded Character Set Identifiers (CCSIDs). See the Db2® documentation for a list of these code pages.

To check the database code page, use the following Db2 command:

db2 GET DB CFG for dbname

and take the value written for Database code page.

For consistency, Db2 normally converts the code page of a document to the code page of the database. However, when you store data in a Db2 database in a column with a binary data type, such as BLOB or FOR BIT DATA, Db2 does not convert the data, and the documents retain their original CCSIDs.

Note that incompatible code pages might cause problems when creating a text index or searching.