Character Pattern annotator
The
Character Pattern annotator uses regular expressions to extract tokens from
the text and then assign facets to the tokens.
You can create Java™ regular expressions that will assign facets to text strings that match the regular expressions.
In the New Pattern text box enter a regular expression. You can also click Examples and then click one of the sample regular expressions to copy it into the text box. Click Add to open the New Pattern dialog and update the following fields.
- Regular Expression
- The regular expression string.
- Name
- The name of this pattern.
- Description
- A description (optional).
- Facet
- The facet path. The text which is matched to this regular expression will included in this
facet. To specify hierarchy, use period (.) separated syntax like
<aaa>.<bbb>
. - Facet Value
- The facet value: You can create the facet value from matched text.
$0 ... $n
is the template value. For example,if the regular expression is"(123)-(345)"
, and the text is123-345
, then$0 = (123)-(345), $1 = 123, $2 = 345
.$0
is the same as the matched text and is the default value. You can use the normal text. For example, specify"value:$0"
, then the facet value is"value:(123)-(345)"
. Or you can just use the simple text.
Click Save to add the pattern to the list of patterns. To test patterns, click Test, add a test string, and then click Add. Repeat this operation to add additional test strings. Click Test it to apply the pattern in the New Pattern text box to the test strings. If the pattern is matched, the matched text is highlighted in green and the facet value is displayed.
Performance Tips
Regular Expressions are compiled to Java Patterns. In general, users should keep the following considerations in mind.
- Keep the regular expressions manageable and understandable.
- Create regular expressions that lead to a match or a non-match quickly
- Catastrophic backtracking
- A catastrophic backtracking issue usually occurs when the regex engine fails to make a negative
match towards the end of the string and attempts too many permutations.
Consider the following pattern
(a+b*)+c
and textaaaaaaaaaaaaaaaaaaaaaaaaad
. The text does not match the pattern but the regex engine backtracks and the matching runs excessively slow.Possessive quantifiers can be used to disallow any backtracking. For example,
(a+b*)++c
. - Alternation
- Try to extract common patterns. For example, use
a(b|c|d)
instead of(ab|ac|ad)
. - Capturing Groups
- Capturing groups can be used in patterns and the results can be referenced with
$n
infacetValue
wheren
is the order of the capturing group.$0
stands for the text matched by the entire pattern.However, capturing groups incurs penalties on performance. Always use non-capturing groups if you do not need to capture. For example,. use
(?:X)
instead of(X)
.