Builtin converters
This topic lists the available builtin converters and describes their settings, if any.
All converters have the following general settings.
- Converter Name
- The name of the converter. Alphanumeric characters, spaces, underscores, and hyphens are allowed.
- Converter description
- A description of the converter.
- Content-types of input data for this converter
- The document content types that the converter handles.
- Content-types of metadata for conversion
- The document content types that the converter handles as metadata.
- Content-type of output data for this converter
- The output content type of the converted document. The default value means that the content type depends on the input content type.
- Conditional URL
- A regular expression. The converter runs only when the input document name matches theURL. If this value is empty, the converter always runs.
- Binary Detect Converter
- Examines the document with Apache Tika and adds to each document the best guess of a content
type. This converter does not modify the input data but adds the content type of that data so that
subsequent converters can convert the data to an indexable form. See Apache Tika
Supported Document Formats for more information on Apache Tika.
You would normally use this converter with the File Extension To Mime Type Converter and Guess ContentConverter in the following sequence.
- File Extension To Mime Type Converter
- Binary Detect Converter
- Guess ContentConverter
- Settings
- Keep original mimetype
- Keeps original content type if already set. The default value is
true
. - Use file extension
- Uses the file extension to guess the content type. The default value is
false
. - Text detection
- Specifies whether to detect
text/plain
content type. The default value isfalse
.
- Binary File Converter
- Parses a binary file. The following file formats are supported.
- Microsoft Excel '97(-2007)
- Microsoft Excel 2007 OOXML (.xlsx)
- Microsoft Outlook MSG email
- Microsoft Outlook PST email
- Microsoft PowerPoint 2007 OOXML (.pptx)
- Microsoft Powerpoint '97(-2007)
- Microsoft Word 2007 OOXML (.docx)
- Microsoft Word 97(-2007)
- Portable Document Format (PDF)
- mbox used by Unix-style mailbox
- Settings
-
- Max Character Length
- The maximum number of characters in the parsed text. If the document has more characters, the convert will fail. The default value is 0 (no limit).
- Content Type Converter
- Adds correct content type for metadata extracted by crawler.
- Settings
- None.
- CSV Converter
- Parses CSV file.
- Settings
-
- Separators
- Specify separators. For example,
,;
. - Quote
- Specify a quotation character. For example,
"
. - Starting line as a header
- Read the starting line as a header, which enables the column names to be mapped to index fields.
- Merge metadata
- Merge metadata into each document. This setting increases storage consumption.
- Document Filter
- Removes documents from the converter pipeline. This filter can be used to avoid indexing
documents that do not have any interesting text or metadata.
- Settings
-
- Content types
- Keep documents with the specified content types. The default values are
text/axml text/plain application/metadata application/document-deletion application/crawlspace-deletion
. - URLs
- Keep documents with the URLs specified here.
- Encoding Converter
- Detects the text encoding type and converts it.
- Settings
-
Mark Limit
- Optional. Maximum number of bytes that the encoding detector will check to detect the
encoding.
Default value: 120000.
Filtering Content Type
- Optional. Content types for filtering. For these content types, text within angle brackets
(
<
and>
) will be removed before detection.Default value:text/xml, application/xml, text/html, application/vnd.wap.xhtml+xml, application/x-asp, application/xhtml+xml
Force Encoding
- Optional. If this parameter is set, the parameter is used as the encoding without any detection.
- Field Filter
- Configures filters for parsing field values in different ways, such as combining field values
into a single multiple-value field so that the values can be analyzed as a single facet.
- Filter list
- A list of field filters that you have created. For instructions, see Adding filters to a field filter.
- File Extension To Mime Type Converter
- Guesses content type by file extension.
- Settings
-
- Keep original mimetype
- Specify whether or not to keep the original mimetype.
- Ignore capital letters in file extensions
- Specify whether or not to ignore the capital letters in the file extensions.
- File extension to mimetype
- Specify the file extension to mimetype mapping. For example,
.csv
totext/csv
.
- Guess Content Converter
- Guesses content type by file content.
- Settings
-
- Max bytes examined
- Maximum number of bytes examined by this converter.
- Keep original mimetype
- Specify whether or not to keep the original mimetype.
- Type override
- Specify type overrides For example,
orgType1
tonewType1
.
- GZIP (gunzip) Converter
- Extracts .gz files.
- Settings
- None.
- HTML Converter
- Parses .html files.
- Settings
-
- Meta names to output
- The content of the meta tags that match this set of wildcard expressions will be output.
- Body XPath
- When extracting text from a page, this XPath expression can be used to extract one or more starting points on the page. If there are no matches for the XPath expression, the top of the page will be used.
- Tags to strip
- Any tag that matches this XPath expression will be discarded along with any subnodes it
contains. This expression is the match on an
xsl:select
node. If a tag matches both the tags to strip and the tags to keep, the output is undefined. - Tags to keep
- Any tag that matches this XPath expression will be included in the indexed text (without any
attributes). Any tag that does not match either this XPath or the Tags to strip XPath will be
removed but its children will be processed. This expression is the match on an
xsl:select
node. If a tag matches both the tags to strip and the tags to keep, the output is undefined. - Include ALT text
- This flag enables the inclusion of ALT text on images and anchor tags.
- Title XPath
- When extracting the title of a page, this XPath expression can be used to specify a location. If there are no matches for this XPath expression, the default HTML title tag will be used.
- Disable title extraction
- If this is selected, titles will not be extracted from the HTML to create extra "title" content. This is useful when the title is provided through external meta-data.
HTML (XLST) Converter
- Converts HTML to AXML, an internal data format for documents with XSLT. Basic functionality is same as HTML Converter.
- Settings
-
- Meta names to output
- The content of the meta tags that match this set of wildcard expressions will be output.
- Body XPath
- When extracting text from a page, this XPath expression can be used to extract one or more starting points on the page. If there are no matches for the XPath expression, the top of the page will be used.
- Tags to strip
- Any tag that matches this XPath expression will be discarded along with any subnodes it
contains. This expression is the match on an
xsl:select
node. If a tag matches both the tags to strip and the tags to keep, the output is undefined. - Tags to keep
- Any tag that matches this XPath expression will be included in the indexed text (without any
attributes). Any tag that does not match either this XPath or the Tags to strip XPath will be
removed but its children will be processed. This expression is the match on an
xsl:select
node. If a tag matches both the tags to strip and the tags to keep, the output is undefined. - Include ALT text
- This flag enables the inclusion of ALT text on images and anchor tags.
- Title XPath
- When extracting the title of a page, this XPath expression can be used to specify a location. If there are no matches for this XPath expression, the default HTML title tag will be used.
- Disable title extraction
- If this is selected, titles will not be extracted from the HTML to create extra "title" content. This is useful when the title is provided through external meta-data.
- JSON Converter
- Parses JSON files.
- Settings
-
- Merge metadata
- Merge metadata into each document. This setting increases storage consumption.
- Tar Converter
- Extracts .tar files.
- Settings
-
- Remove metadata
- Remove the metadata from the original compressed file.
PDF Converter
- Parses .pdf contents. Then, it converts the contents to meta information and text strings in the
AXML Document.
- Settings
-
- Max Memory Bytes
- Max memory bytes used for buffering. If you want to use only memory, set to -1. If you want to use only a temporary file, set to 0.
- Advanced Settings
- A key-value list of parameters, you can configure advanced settings here.
- Advanced Settings
-
- max_storage_bytes
- The supported value for max_storage_bytes is a long integer, this value is the maximum number of memory bytes used for buffering. If you want to use only memory, set to -1. If you want to use only a temporary file, set to 0. Setting to empty or negative value means no limit. Default is no limit. If Max Memory Bytes is set to -1, this configuration will be ignored.
- vertical_style
-
The supported value for vertical_style are the strings: TRUE, AUTO and otherwise.
TRUE: Parse as vertical style PDF
AUTO: Detect and parse automatically as vertical style PDF
Otherwise: Parse as normal PDF
- Truncation Converter
- Truncates text fields.
- Settings
-
- Max Character Length
- Specifies the truncation limit.
- ZIP Converter
- Extracts .zip files.
- Settings
-
- Remove metadata
- Remove the metadata from the original compressed file.