Pattern Matcher annotator

The Pattern Matcher annotator captures patterns that are constructed from one or more words in the input text. The text is mapped to predefined facets for the parts of speech, such as nouns and verbs, and phrase patterns, such as a noun sequence.

The Pattern Matcher annotator can be used with content analytics collections only.

In the administration console, an administrator can configure rules for the patterns that are to be extracted and analyzed and associates the rules with facets. When the annotator runs, it uses the rules to extract the defined patterns of text. Pattern matching during text analysis is case-sensitive.

If you use IBM® Content Analyzer and have user-defined pattern definitions (rule files) that you use with Pattern Matcher, you can use the pattern definitions with Watson Explorer Content Analytics if both of the following conditions are met:

The new facet path structure is the same as the existing category path structure.
The appropriate language is set for the definition files.

This annotator captures patterns constructed from one or more words in the input text. A pattern is a sequence of words with constraints. The following constraints are available:

Table 1. Constraints in pattern matching
Constraint	Description	Example
str	Surface string (the exact characters that appear in the input text)	ate
lex	Lemma of the word	eat
pos	The part of speech that the word represents	noun
ftrs	Additional features (attributes) of the words	proper
category	The facet path assigned by the Dictionary Lookup annotator	$.myword
guard	If a word is set as a guard, it matches against a word that meets other constraints (as usual), or the beginning or end of the sentence. For example, if you want to capture the sequence of exactly two nouns, the pattern is `"!noun" "noun" "noun" "!noun"`. But a match does not result if two nouns appear at the beginning of the sentence because the first element does not match. Set `guard="true"` for the first and the last elements to guard the inner two nouns, which are the ones that you want. The default value is `guard="false"`.

Content analytics collections have predefined pattern definitions to provide default text analytics capability. The following facets are defined by default for part-of-speech analysis. Part-of-speech analysis is provided for all languages.

Table 2. Facets for the parts of speech
Facet path	Facet name
$._word.noun.general	General Noun
$._word.noun.unk	Unknown
$._word.verb	Verb
$._word.adj	Adjective
$._word.adv	Adverb
$._word.conj	Conjunction
$._word.intj	Interjection
$._word.num	Numeral

The following facets are defined by default for phrase analysis. Phrase analysis is not the same for all languages. For example, some facets are not used for some languages.

Table 3. Facets for phrase analysis
Facet path	Facet name
$._phrase.noun_phrase.nouns	Noun Sequence
$._phrase.noun_phrase.mod_noun	Modified Noun
$._phrase.noun_phrase.adp_noun	Preposition Noun
$._phrase.pred_phrase.adv_pred	Predicate with Adverb
$._phrase.pred_phrase.noun_pred	Noun - Predicate
$._phrase.pred_phrase.verb_noun	Verb - Noun
$._phrase.conj_phrase.resultative	Resultative Conjunction
$._phrase.conj_phrase.contradictory	Contradictory Conjunction