Lexical Analysis Streams

Watson™ Explorer Engine version and greater includes the ability to incorporate lexical analysis language streams, which provide improved language support in Watson Explorer. Lexical analysis language streams are powered by language-specific PEAR files, which were developed using the analytical components of Watson Explorer. Pre-built PEAR files are included for 17 languages: Arabic, Chinese, Czech, Danish, Dutch, French, German, Greek, Hebrew, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.

PEAR files use lemmatization to process language streams. Lemmatization is similar to depluralization and stemming, but is more powerful because it also takes verb tenses into account.

Each of the pre-built PEAR files includes all the necessary components for Watson Explorer Engine to ingest, lemmatize, and search documents in the selected language. Lexical Analysis streams support knowledge bases, stop words, term expansion dictionaries, wildcards, and regular expressions.

To add a lexical analysis stream to a search collection, see Adding Index Streams.

Notes on configuring your project for lexical analysis streams:

  • If your project contains one or more collections that use lexical analysis streams, the Main language should be set to custom on the project's Simple tab. See Language Settings for the list of settings.
  • By default, clustering will not take advantage of the language support provided by PEAR files, even if the underlying search collections use PEARs via lexical analysis streams. To improve language support during clustering, use the Explicit stream configuration field on the project's Advanced > Clustering tab. See Clustering Settings for the list of settings.
  • Stemmers will not take advantage of the language support provided by PEAR files, so set the Stemmers field to none in the Clustering Settings. Also set Enable stem expansion to false in the project's Advanced > Metasearch tab, in the Query expansion section. See Query Modification Settings for the list of settings.
  • Each search collection supports a single lexical analysis language.
  • Phrase breaking punctuation varies by language for lexical analysis streams. In general, lexical analysis streams treat double quotes as non-phrase breaking.
  • Knowledge bases will be disabled for lexical analysis streams if the list of knowledge bases is equal to the default knowledge base list. The default knowledge base is core+web+english+custom.
  • Changing the language of a collection will require a re-crawl of the collection's data. (Any change in a collection's indexer stream configuration will require a re-crawl; this is not specific to lexical analysis streams.)
  • Backwards compatibility for existing non-lexical analysis streams is supported.

If you have access to IBM® Watson Explorer Content Analytics Studio (which is included in the analytical components of Watson Explorer), you can modify the pre-built PEAR file project included with Watson Explorer, or create additional language PEAR files. See Creating Custom PEAR Files for Use with Lexical Analysis Streams. PEAR file customizations can include changing how a word behaves (if it is not lemmatized the way you would prefer) and more. See Creating a Custom PEAR File with a Custom Dictionary.

See UIMA Software Development Kit for additional information about creating PEAR files.