Document format detection

A default mapping of URL extension and MIME type is used to determine document types and the parser type to use with each document.

By editing the parser_config.xml file and mimetypes.xml file, you can override and extend the default mapping of URL extensions and MIME types to parser types. These files define rules for mapping file extensions or MIME types to parser types. For example, you can map a file extension such as .content and specify that documents of that type are to be parsed by the HTML parser.

Different document formats have different internal representations. The Watson Content Analytics system uses internal and third party filters for parsing documents, and many documents are parsed with parser services that are specialized for a specific format.

Document format detection and parser assignment occurs in the following way:

The algorithm for detecting the document format checks the extension of the URL of the processed document. The extension of the filename, which is part of the metadata that is set by the crawler, is also considered when detecting the document format.
The system checks the MIME type of the document, which is part of the metadata that is set by the crawler.
The system tries to assign the appropriate parser type to each document. For HTML, text (TXT), and XML documents, the system assigns a parser type that is specific to each document format.
For some other document formats, the system uses the text extractor. The text extractor document filtering technology is based on Oracle Outside In Content Access technology. This technology was formerly named Stellent, and the names of some configuration files include the term Stellent.

The text extractor supports several hundred document formats, but only a subset of the document filters are enabled in Watson Content Analytics. You can edit configuration files, however, to allow other document types to be parsed by the text extractor.

Important: Document filters that you add that do not belong to the subset of document filters that are enabled in the default system configuration have not been tested and are not supported.

If the system cannot identify the document format of a document, the document is rejected. You might see an error message that states that the document type is not supported.
To determine the document type and parser type, the system does the following steps:
1. Compares the file name to the rules in mimetypes.xml file. If the file name is not specified, the system compares the URL extension (the extension of the document ID) to the rules.
2. Compares the MIME type to the rules in mimetypes.xml file to get the normalized type.
3. Compares the document type (normalized MIME type) to the rules in the parser_config.xml file.

If the parser type is stellent, you might see an error message if the text extractor cannot recognize the document format. The error can occur if:

The document is corrupted.
The document is not in a format that the text extractor supports. To solve this problem, you need to add the rejected file formats to the stellentTypes.cfg file. You also need to update the mimetypes.xml file or parser_config.xml file to specify that the MIME type or extension of the rejected document formats are to be associated with the text extractor.