Manually configuring character rules
You can manually configure character rules that are based on regular expressions to define patterns of characters that make up a particular word or token. For each pattern, you specify the UIMA annotation to create when that pattern is found in a document.
About this task
You manually configure character rules in CHARRULES files. A character rules file contains one or more of the following statements. Each statement ends with a semicolon.
- Assignment statements
- Assignment statements assign a regular expression to a variable.
The variable can then be used in other assignments or
match statements. The use of assignment statements simplifies
complex regular expressions by breaking them down into smaller
sections. Assignment statements are also useful if the
same regular expression is used multiple times in the
same character rule file. An assignment statement has the following
syntax:
variable = regex;
The variable name must start with a $ character, followed by a character or an underscore, and then followed by any number of characters, numbers, or underscores. For example, $A, $A_Variable, and $var2 are valid variable names.
regex is any valid regular expression. The regular expression can include variables. The format for the regular expressions is similar to that of Java™ regular expressions, with the following main differences:
- White space is ignored and must be escaped.
- Character class shortcuts such as
\d
are not supported. Instead, specify \p{Digit}. - The syntax for repetition is
{*m,n}
instead of{m,n}
. - The operators & and ~ are used for intersection and difference of regular languages.
- Lazy operators and back references are not supported.
- Match statements
- Match statements define a regular expression for matching text
in a document. A match statement also specifies the type
of annotation to create over the text that matches the regular
expression. A match statement has the following syntax:
regex {anno:"UIMA_Type"};
regex is any valid regular expression and can include variables.
UIMA_Type is the name of the UIMA Type to create over the matching text, and must be a valid UIMA type name.
Restriction: Unlike Java, Perl, and other regular expression engines, character rules operate only on token boundaries. Matches that do not fully cover a token are not annotated. For example, a character rule that identifies currencies does not match $3 in the phrase $3m if 3m is a single token.
Lines beginning with a number sign (#) character are treated as comments.
The following example shows a simple
character rule file to identify Roman numerals. First, variables such
as $Units
and $Tens
are defined.
Then, these variables are used in match statements that each create
an annotation of type com.ibm.studio.userguide.RomanNumeral.
# Character rules to identify Roman numerals.
# These rules support only uppercase letters.
$Units = I|II|III|IV|V|VI|VII|VIII|IX;
$Tens = X|XX|XXX|XL|L|LX|LXX|LXXX|XC;
$Hundreds = C|CC|CCC|CD|D|DC|DCC|DCCC|CM;
$Thousands = M|MM|MMM;
$Units {anno:"com.ibm.studio.userguide.RomanNumeral"};
$Tens ($Units)? {anno:"com.ibm.studio.userguide.RomanNumeral"}; ;
$Hundreds ($Tens)? ($Units)? {anno:"com.ibm.studio.userguide.RomanNumeral"}; ;
$Thousands ($Hundreds)? ($Tens)? ($Units)? {anno:"com.ibm.studio.userguide.RomanNumeral"};
Procedure
To manually configure character rules:
What to do next
Whenever you update the characters rules, you must save the updated character rules file and rebuild the character rules dictionary so that your pipeline uses the updated rules to analyze documents.