Manually configuring character rules

You can manually configure character rules that are based on regular expressions to define patterns of characters that make up a particular word or token. For each pattern, you specify the UIMA annotation to create when that pattern is found in a document.

About this task

You manually configure character rules in CHARRULES files. A character rules file contains one or more of the following statements. Each statement ends with a semicolon.

Assignment statements

Assignment statements assign a regular expression to a variable. The variable can then be used in other assignments or match statements. The use of assignment statements simplifies complex regular expressions by breaking them down into smaller sections. Assignment statements are also useful if the same regular expression is used multiple times in the same character rule file. An assignment statement has the following syntax:

variable = regex;

The variable name must start with a $ character, followed by a character or an underscore, and then followed by any number of characters, numbers, or underscores. For example, $A, $A_Variable, and $var2 are valid variable names.

regex is any valid regular expression. The regular expression can include variables. The format for the regular expressions is similar to that of Java™ regular expressions, with the following main differences:

White space is ignored and must be escaped.
Character class shortcuts such as \d are not supported. Instead, specify \p{Digit}.
The syntax for repetition is {*m,n} instead of {m,n}.
The operators & and ~ are used for intersection and difference of regular languages.
Lazy operators and back references are not supported.

Match statements

Match statements define a regular expression for matching text in a document. A match statement also specifies the type of annotation to create over the text that matches the regular expression. A match statement has the following syntax:

regex {anno:"UIMA_Type"};

regex is any valid regular expression and can include variables.

UIMA_Type is the name of the UIMA Type to create over the matching text, and must be a valid UIMA type name.

Restriction: Unlike Java, Perl, and other regular expression engines, character rules operate only on token boundaries. Matches that do not fully cover a token are not annotated. For example, a character rule that identifies currencies does not match $3 in the phrase $3m if 3m is a single token.

Lines beginning with a number sign (#) character are treated as comments.

The following example shows a simple character rule file to identify Roman numerals. First, variables such as $Units and $Tens are defined. Then, these variables are used in match statements that each create an annotation of type com.ibm.studio.userguide.RomanNumeral.

# Character rules to identify Roman numerals.
# These rules support only uppercase letters.

$Units			= I|II|III|IV|V|VI|VII|VIII|IX;
$Tens				= X|XX|XXX|XL|L|LX|LXX|LXXX|XC;
$Hundreds		= C|CC|CCC|CD|D|DC|DCC|DCCC|CM;
$Thousands = M|MM|MMM;

$Units {anno:"com.ibm.studio.userguide.RomanNumeral"};
$Tens	($Units)? {anno:"com.ibm.studio.userguide.RomanNumeral"}; ;
$Hundreds ($Tens)?	($Units)? {anno:"com.ibm.studio.userguide.RomanNumeral"}; ;
$Thousands ($Hundreds)? ($Tens)?	($Units)? {anno:"com.ibm.studio.userguide.RomanNumeral"};

Procedure

To manually configure character rules:

In the Studio Explorer view, right-click the Resources/Character Rules directory in your project and click New > Character Rules File.
Define statements in the CHARRULES file to create character rules.
After you save the file, build the character rules dictionary by clicking the Build icon.
Include the character rules dictionary file in your UIMA pipeline configuration:
1. From the Configuration/Annotators directory, open the ANNOCONFIG file for your pipeline.
2. Select the Lexical Analysis stage, select the appropriate language, and add the new character rules DIC file to the list of dictionaries.
Test the rule on a sample document:
1. Annotate a document that contains text that matches the specified character sequence of the rule.
  From the Documents directory, open a document. Right-click the document in the editor view and click Analyze Document. Ensure that you select the UIMA pipeline to which you added the new character rules file.
2. In the Outline view for the annotated document, review the annotations.
  If the rules did not identify all instances of the character sequence in the document, edit the rules in the CHARRULES file.

What to do next

Whenever you update the characters rules, you must save the updated character rules file and rebuild the character rules dictionary so that your pipeline uses the updated rules to analyze documents.