The parsers, document processors, and annotators
analyze documents that were collected by a crawler and prepare them
for indexing.
The results of the analysis are stored in a data store for access
by the indexing component. Document processing includes the following
tasks:
- Detect the character set encoding of each document. Before doing
any linguistic analysis, the parser uses this information to convert
all text to Unicode.
- Detect the source language of each document.
- Extract text from whatever format a document is in.
For example, the parser extracts text from the tags in XML and HTML
documents. The text extractor extracts text from binary formats such
as Microsoft Word and Adobe portable
document format (PDF) documents.
- Extract text and add tokens to enhance data retrieval and mining.
During this phase, the parser does the following tasks:
- Character normalization, such as normalizing capitalization and
diacritical marks such as the German umlaut.
- Analyzing the structure of paragraphs, sentences, words, and white
space. Through linguistic analysis, the parser decomposes compound
words and assigns tokens that enable dictionary and synonym lookup.
- Applies document processing rules that you specify for the collection.
For example, you can configure facets for exploring content and apply
enterprise-specific text analytics through custom annotators and dictionaries.