Builtin converters

This topic lists the available builtin converters and describes their settings, if any.

All converters have the following general settings.
Converter Name
The name of the converter. Alphanumeric characters, spaces, underscores, and hyphens are allowed.
Converter description
A description of the converter.
Content-types of input data for this converter
The document content types that the converter handles.
Content-types of metadata for conversion
The document content types that the converter handles as metadata.
Content-type of output data for this converter
The output content type of the converted document. The default value means that the content type depends on the input content type.
Conditional URL
A regular expression. The converter runs only when the input document name matches theURL. If this value is empty, the converter always runs.
Binary Detect Converter
Examines the document with Apache Tika and adds to each document the best guess of a content type. This converter does not modify the input data but adds the content type of that data so that subsequent converters can convert the data to an indexable form. See Apache Tika Supported Document Formats for more information on Apache Tika.

You would normally use this converter with the File Extension To Mime Type Converter and Guess ContentConverter in the following sequence.

  1. File Extension To Mime Type Converter
  2. Binary Detect Converter
  3. Guess ContentConverter
Settings
Keep original mimetype
Keeps original content type if already set. The default value is true.
Use file extension
Uses the file extension to guess the content type. The default value is false.
Text detection
Specifies whether to detect text/plain content type. The default value is false.
None.
Binary File Converter
Parses a binary file. The following file formats are supported.
  • Microsoft Excel '97(-2007)
  • Microsoft Excel 2007 OOXML (.xlsx)
  • Microsoft Outlook MSG email
  • Microsoft Outlook PST email
  • Microsoft PowerPoint 2007 OOXML (.pptx)
  • Microsoft Powerpoint '97(-2007)
  • Microsoft Word 2007 OOXML (.docx)
  • Microsoft Word 97(-2007)
  • Portable Document Format (PDF)
  • mbox used by Unix-style mailbox
Settings
Max Character Length
The maximum number of characters in the parsed text. If the document has more characters, the convert will fail. The default value is 0 (no limit).
Content Type Converter
Adds correct content type for metadata extracted by crawler.
Settings
None.
CSV Converter
Parses CSV file.
Settings
Separators
Specify separators. For example, ,;.
Quote
Specify a quotation character. For example, ".
Starting line as a header
Read the starting line as a header, which enables the column names to be mapped to index fields.
Merge metadata
Merge metadata into each document. This setting increases storage consumption.
Document Filter
Removes documents from the converter pipeline. This filter can be used to avoid indexing documents that do not have any interesting text or metadata.
Settings
Content types
Keep documents with the specified content types. The default values are text/axml text/plain application/metadata application/document-deletion application/crawlspace-deletion.
URLs
Keep documents with the URLs specified here.
Encoding Converter
Detects the text encoding type and converts it.
Settings
Applies to version 12.0.2.1 and subsequent versions unless specifically overridden Mark Limit
Optional. Maximum number of bytes that the encoding detector will check to detect the encoding.

Default value: 120000.

Applies to version 12.0.2.1 and subsequent versions unless specifically overridden Filtering Content Type
Optional. Content types for filtering. For these content types, text within angle brackets (< and >) will be removed before detection.
Default value:
text/xml, application/xml, text/html, application/vnd.wap.xhtml+xml, application/x-asp, application/xhtml+xml
Applies to version 12.0.2.1 and subsequent versions unless specifically overridden Force Encoding
Optional. If this parameter is set, the parameter is used as the encoding without any detection.
Field Filter
Configures filters for parsing field values in different ways, such as combining field values into a single multiple-value field so that the values can be analyzed as a single facet.
Filter list
A list of field filters that you have created. For instructions, see Adding filters to a field filter.
File Extension To Mime Type Converter
Guesses content type by file extension.
Settings
Keep original mimetype
Specify whether or not to keep the original mimetype.
Ignore capital letters in file extensions
Specify whether or not to ignore the capital letters in the file extensions.
File extension to mimetype
Specify the file extension to mimetype mapping. For example, .csv to text/csv.
Guess Content Converter
Guesses content type by file content.
Settings
Max bytes examined
Maximum number of bytes examined by this converter.
Keep original mimetype
Specify whether or not to keep the original mimetype.
Type override
Specify type overrides For example, orgType1 to newType1.
GZIP (gunzip) Converter
Extracts .gz files.
Settings
None.
HTML Converter
Parses .html files.
Settings
Meta names to output
The content of the meta tags that match this set of wildcard expressions will be output.
Body XPath
When extracting text from a page, this XPath expression can be used to extract one or more starting points on the page. If there are no matches for the XPath expression, the top of the page will be used.
Tags to strip
Any tag that matches this XPath expression will be discarded along with any subnodes it contains. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Tags to keep
Any tag that matches this XPath expression will be included in the indexed text (without any attributes). Any tag that does not match either this XPath or the Tags to strip XPath will be removed but its children will be processed. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Include ALT text
This flag enables the inclusion of ALT text on images and anchor tags.
Title XPath
When extracting the title of a page, this XPath expression can be used to specify a location. If there are no matches for this XPath expression, the default HTML title tag will be used.
Disable title extraction
If this is selected, titles will not be extracted from the HTML to create extra "title" content. This is useful when the title is provided through external meta-data.
HTML (XLST) Converter
Converts HTML to AXML, an internal data format for documents with XSLT. Basic functionality is same as HTML Converter.
Settings
Meta names to output
The content of the meta tags that match this set of wildcard expressions will be output.
Body XPath
When extracting text from a page, this XPath expression can be used to extract one or more starting points on the page. If there are no matches for the XPath expression, the top of the page will be used.
Tags to strip
Any tag that matches this XPath expression will be discarded along with any subnodes it contains. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Tags to keep
Any tag that matches this XPath expression will be included in the indexed text (without any attributes). Any tag that does not match either this XPath or the Tags to strip XPath will be removed but its children will be processed. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Include ALT text
This flag enables the inclusion of ALT text on images and anchor tags.
Title XPath
When extracting the title of a page, this XPath expression can be used to specify a location. If there are no matches for this XPath expression, the default HTML title tag will be used.
Disable title extraction
If this is selected, titles will not be extracted from the HTML to create extra "title" content. This is useful when the title is provided through external meta-data.
JSON Converter
Parses JSON files.
Settings
Merge metadata
Merge metadata into each document. This setting increases storage consumption.
Tar Converter
Extracts .tar files.
Settings
Remove metadata
Remove the metadata from the original compressed file.
PDF Converter
Parses .pdf contents. Then, it converts the contents to meta information and text strings in the AXML Document.
Settings
Max Memory Bytes
Max memory bytes used for buffering. If you want to use only memory, set to -1. If you want to use only a temporary file, set to 0.
Advanced Settings
A key-value list of parameters, you can configure advanced settings here.
Advanced Settings
max_storage_bytes
The supported value for max_storage_bytes is a long integer, this value is the maximum number of memory bytes used for buffering. If you want to use only memory, set to -1. If you want to use only a temporary file, set to 0. Setting to empty or negative value means no limit. Default is no limit. If Max Memory Bytes is set to -1, this configuration will be ignored.
vertical_style

The supported value for vertical_style are the strings: TRUE, AUTO and otherwise.

TRUE: Parse as vertical style PDF

AUTO: Detect and parse automatically as vertical style PDF

Otherwise: Parse as normal PDF

Truncation Converter
Truncates text fields.
Settings
Max Character Length
Specifies the truncation limit.
ZIP Converter
Extracts .zip files.
Settings
Remove metadata
Remove the metadata from the original compressed file.