Parsers, document processors, and annotators

The parsers, document processors, and annotators analyze documents that were collected by a crawler and prepare them for indexing.

The results of the analysis are stored in a data store for access by the indexing component. Document processing includes the following tasks:

Detect the character set encoding of each document. Before doing any linguistic analysis, the parser uses this information to convert all text to Unicode.
Detect the source language of each document.
Extract text from whatever format a document is in. For example, the parser extracts text from the tags in XML and HTML documents. The text extractor extracts text from binary formats such as Microsoft Word and Adobe portable document format (PDF) documents.
Extract text and add tokens to enhance data retrieval and mining. During this phase, the parser does the following tasks:
- Character normalization, such as normalizing capitalization and diacritical marks such as the German umlaut.
- Analyzing the structure of paragraphs, sentences, words, and white space. Through linguistic analysis, the parser decomposes compound words and assigns tokens that enable dictionary and synonym lookup.
Applies document processing rules that you specify for the collection. For example, you can configure facets for exploring content and apply enterprise-specific text analytics through custom annotators and dictionaries.