Builtin converters

This topic lists the available builtin converters and describes their settings, if any.

All converters have the following general settings.

Converter Name: The name of the converter. Alphanumeric characters, spaces, underscores, and hyphens are allowed.
Converter description: A description of the converter.
Content-types of input data for this converter: The document content types that the converter handles.
Content-types of metadata for conversion: The document content types that the converter handles as metadata.
Content-type of output data for this converter: The output content type of the converted document. The default value means that the content type depends on the input content type.
Conditional URL: A regular expression. The converter runs only when the input document name matches theURL. If this value is empty, the converter always runs.

Binary Detect Converter

Examines the document with Apache Tika and adds to each document the best guess of a content type. This converter does not modify the input data but adds the content type of that data so that subsequent converters can convert the data to an indexable form. See Apache Tika Supported Document Formats for more information on Apache Tika.

You would normally use this converter with the File Extension To Mime Type Converter and Guess ContentConverter in the following sequence.

File Extension To Mime Type Converter
Binary Detect Converter
Guess ContentConverter

Settings

Keep original mimetype: Keeps original content type if already set. The default value is true.
Use file extension: Uses the file extension to guess the content type. The default value is false.
Text detection: Specifies whether to detect text/plain content type. The default value is false.

None.

Binary File Converter

Parses a binary file. The following file formats are supported.

Microsoft Excel '97(-2007)
Microsoft Excel 2007 OOXML (.xlsx)
Microsoft Outlook MSG email
Microsoft Outlook PST email
Microsoft PowerPoint 2007 OOXML (.pptx)
Microsoft Powerpoint '97(-2007)
Microsoft Word 2007 OOXML (.docx)
Microsoft Word 97(-2007)
Portable Document Format (PDF)
mbox used by Unix-style mailbox

Settings

Max Character Length: The maximum number of characters in the parsed text. If the document has more characters, the convert will fail. The default value is 0 (no limit).

Content Type Converter

Adds correct content type for metadata extracted by crawler.

Settings: None.

CSV Converter

Parses CSV file.

Settings

Separators: Specify separators. For example, ,;.
Quote: Specify a quotation character. For example, ".
Starting line as a header: Read the starting line as a header, which enables the column names to be mapped to index fields.
Merge metadata: Merge metadata into each document. This setting increases storage consumption.

Document Filter

Removes documents from the converter pipeline. This filter can be used to avoid indexing documents that do not have any interesting text or metadata.

Settings

Content types: Keep documents with the specified content types. The default values are text/axml text/plain application/metadata application/document-deletion application/crawlspace-deletion.
URLs: Keep documents with the URLs specified here.

Encoding Converter

Detects the text encoding type and converts it.

Settings

Mark Limit

Optional. Maximum number of bytes that the encoding detector will check to detect the encoding.

Default value: 120000.

Filtering Content Type

Optional. Content types for filtering. For these content types, text within angle brackets (< and >) will be removed before detection.

Default value:

text/xml, application/xml, text/html, application/vnd.wap.xhtml+xml, application/x-asp, application/xhtml+xml

Force Encoding

Optional. If this parameter is set, the parameter is used as the encoding without any detection.

Field Filter

Configures filters for parsing field values in different ways, such as combining field values into a single multiple-value field so that the values can be analyzed as a single facet.

Filter list: A list of field filters that you have created. For instructions, see Adding filters to a field filter.

File Extension To Mime Type Converter

Guesses content type by file extension.

Settings

Keep original mimetype: Specify whether or not to keep the original mimetype.
Ignore capital letters in file extensions: Specify whether or not to ignore the capital letters in the file extensions.
File extension to mimetype: Specify the file extension to mimetype mapping. For example, .csv to text/csv.

Guess Content Converter

Guesses content type by file content.

Settings

Max bytes examined: Maximum number of bytes examined by this converter.
Keep original mimetype: Specify whether or not to keep the original mimetype.
Type override: Specify type overrides For example, orgType1 to newType1.

GZIP (gunzip) Converter

Extracts .gz files.

Settings: None.

HTML Converter

Parses .html files.

Settings

Meta names to output: The content of the meta tags that match this set of wildcard expressions will be output.
Body XPath: When extracting text from a page, this XPath expression can be used to extract one or more starting points on the page. If there are no matches for the XPath expression, the top of the page will be used.
Tags to strip: Any tag that matches this XPath expression will be discarded along with any subnodes it contains. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Tags to keep: Any tag that matches this XPath expression will be included in the indexed text (without any attributes). Any tag that does not match either this XPath or the Tags to strip XPath will be removed but its children will be processed. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Include ALT text: This flag enables the inclusion of ALT text on images and anchor tags.
Title XPath: When extracting the title of a page, this XPath expression can be used to specify a location. If there are no matches for this XPath expression, the default HTML title tag will be used.
Disable title extraction: If this is selected, titles will not be extracted from the HTML to create extra "title" content. This is useful when the title is provided through external meta-data.

HTML (XLST) Converter

Converts HTML to AXML, an internal data format for documents with XSLT. Basic functionality is same as HTML Converter.

Settings

Meta names to output: The content of the meta tags that match this set of wildcard expressions will be output.
Body XPath: When extracting text from a page, this XPath expression can be used to extract one or more starting points on the page. If there are no matches for the XPath expression, the top of the page will be used.
Tags to strip: Any tag that matches this XPath expression will be discarded along with any subnodes it contains. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Tags to keep: Any tag that matches this XPath expression will be included in the indexed text (without any attributes). Any tag that does not match either this XPath or the Tags to strip XPath will be removed but its children will be processed. This expression is the match on an xsl:select node. If a tag matches both the tags to strip and the tags to keep, the output is undefined.
Include ALT text: This flag enables the inclusion of ALT text on images and anchor tags.
Title XPath: When extracting the title of a page, this XPath expression can be used to specify a location. If there are no matches for this XPath expression, the default HTML title tag will be used.
Disable title extraction: If this is selected, titles will not be extracted from the HTML to create extra "title" content. This is useful when the title is provided through external meta-data.

JSON Converter

Parses JSON files.

Settings

Merge metadata: Merge metadata into each document. This setting increases storage consumption.

Tar Converter

Extracts .tar files.

Settings

Remove metadata: Remove the metadata from the original compressed file.

PDF Converter

Parses .pdf contents. Then, it converts the contents to meta information and text strings in the AXML Document.

Settings

Max Memory Bytes: Max memory bytes used for buffering. If you want to use only memory, set to -1. If you want to use only a temporary file, set to 0.
Advanced Settings: A key-value list of parameters, you can configure advanced settings here.

Advanced Settings

max_storage_bytes

The supported value for max_storage_bytes is a long integer, this value is the maximum number of memory bytes used for buffering. If you want to use only memory, set to -1. If you want to use only a temporary file, set to 0. Setting to empty or negative value means no limit. Default is no limit. If Max Memory Bytes is set to -1, this configuration will be ignored.

vertical_style

The supported value for vertical_style are the strings: TRUE, AUTO and otherwise.

TRUE: Parse as vertical style PDF

AUTO: Detect and parse automatically as vertical style PDF

Otherwise: Parse as normal PDF

Truncation Converter

Truncates text fields.

Settings

Max Character Length: Specifies the truncation limit.

ZIP Converter

Extracts .zip files.

Settings

Remove metadata: Remove the metadata from the original compressed file.