Default supported document types

When detecting the document format, only certain document types are evaluated.

The following document formats are detected and parsed automatically by built-in, collection parser services:
  • HTML
  • Plain text
  • XML
By default, the following document formats are parsed by the text extractor:
  • Adobe Portable Document Format (PDF)
  • Lotus® 1-2-3
  • Lotus Freelance Graphics®
  • Lotus Word Pro®
  • Just System Ichitaro
  • Microsoft Excel (versions through 2010)
  • Microsoft PowerPoint (versions through 2010)
  • Microsoft Visio
  • Microsoft Word (versions through 2010)
  • Rich Text Format (RTF)
  • StarOffice/OpenOffice Calc
  • StarOffice/OpenOffice Impress
  • StarOffice/OpenOffice Draw
  • StarOffice/OpenOffice Writer

Microsoft Office Open XML file formats and OpenOffice.org OpenDocument formats are handled without the need to make changes to the configuration files.

To parse other types of documents, you must update configuration files (parser_config.xml, mimetypes.xml, and stellenttypes.cfg) to specify rules for mapping specific document types to a collection parser service or the text extractor.