A default mapping of URL extension and MIME type is used
to determine document types and the parser type to use with each document.
By editing the parser_config.xml file and mimetypes.xml file,
you can override and extend the default mapping of URL extensions
and MIME types to parser types. These files define rules for mapping
file extensions or MIME types to parser types. For example, you can
map a file extension such as .content and specify
that documents of that type are to be parsed by the HTML parser.
Different document formats have different internal representations.
The Watson Content Analytics system uses
internal and third party filters for parsing documents, and many documents
are parsed with parser services that are specialized for a specific
format.
Document format detection and parser assignment occurs in the following
way:
- The algorithm for detecting the document format checks the extension
of the URL of the processed document. The extension of the filename,
which is part of the metadata that is set by the crawler, is also
considered when detecting the document format.
- The system checks the MIME type of the document, which is part
of the metadata that is set by the crawler.
- The system tries to assign the appropriate parser type to each
document. For HTML, text (TXT), and XML documents, the system assigns
a parser type that is specific to each document format.
For
some other document formats, the system uses the text extractor. The
text extractor document filtering technology is based on Oracle Outside
In Content Access technology. This technology was formerly named Stellent,
and the names of some configuration files include the term Stellent.
The
text extractor supports several hundred document formats, but only
a subset of the document filters are enabled in Watson Content Analytics. You can edit configuration
files, however, to allow other document types to be parsed by the
text extractor.
Important: Document filters
that you add that do not belong to the subset of document filters
that are enabled in the default system configuration have not been
tested and are not supported.
If the system cannot identify
the document format of a document, the document is rejected. You might
see an error message that states that the document type is not supported.
To
determine the document type and parser type, the system does the following
steps:
- Compares the file name to the rules in mimetypes.xml file.
If the file name is not specified, the system compares the URL extension
(the extension of the document ID) to the rules.
- Compares the MIME type to the rules in mimetypes.xml file
to get the normalized type.
- Compares the document type (normalized MIME type) to the rules
in the parser_config.xml file.
If the parser type is
stellent, you might see
an error message if the text extractor cannot recognize the document
format. The error can occur if:
- The document is corrupted.
- The document is not in a format that the text extractor supports.
To solve this problem, you need to add the rejected file formats to
the stellentTypes.cfg file. You also need to
update the mimetypes.xml file or parser_config.xml file
to specify that the MIME type or extension of the rejected document
formats are to be associated with the text extractor.