Character rules

Character rules identify sequences of characters that represent particular entities in your text, such as telephone numbers, email addresses, or product identifiers.

For example, you might want to identify United States telephone numbers such as 704-501-1500. But because phone numbers can be written in other ways, such as (704) 501-1500, writing a regular expression to find all other the possible variations can be challenging.

Character rules graphical editor

By using the Content Analytics Studio character rules editor, you can generate these character rule expressions graphically by basing the rules on sample text that contains the character sequences. After the sample text is analyzed by Content Analytics Studio, the pattern of character classes that represent the selected text is displayed in a tree format. You can then modify the pattern to match similar sequences of characters and define one or more annotations to create when matching text is found in the document. You can also create features for the annotations. For a United States phone number annotation, you might create a feature for the area code, which is the first three numbers in the telephone number.

The character rules are stored in a character rules database. This database is then built into a character rules dictionary file that can be used in the lexical analysis stage of a UIMA pipeline to analyze text and annotate items of interest. Each character rules database typically contains a collection of character rules that are specific to a particular type of character sequence to be matched, such as telephone numbers in their various formats.
Tips: To help with maintenance of the rules:
  • Group similar character rules in the same database.
  • It is easier to manage and maintain groups of simple character rules. If the character sequence tree for a rule becomes too complex to manipulate, consider creating a separate character rule in the same database to handle alternative cases. You can define multiple rules that create annotations with the same name.

Character rules files

Alternatively, you can manually configure character rules by defining variables and regular expressions in CHARRULES files. This approach is for experienced users who want to create complex character rules.